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IMPROVEMENT  OF  THE 
NARROWBAND  LINEAR  PREDICTIVE  CODER 

PART  2-SYNTHESIS  IMPROVEMENTS 

INTRODUCTION 

For  many  years  the  linear  predictive  coder  (LPC)  has  been  used  to  convert  speech  into  digital 
form  for  secure  voice  transmission  over  narrowband  channels  at  low  bit  rates  (less  than  5%  of  the  origi¬ 
nal  speech  transmission  rate).  The  Navy,  as  a  prime  user  of  narrowband  channels  for  voice  communi¬ 
cations,  has  played  a  significant  role  in  the  research  and  development  of  LPCs.  In  1973  the  Navy  pro¬ 
duced  one  of  the  first  narrowband  LPCs  capable  of  operating  in  real  time.  Since  1978  the  Navy  has 
been  the  Department  of  Defense’s  (DoD’s)  technical  agent  for  the  development  of  LPCs  intended  for 
triservice  tactical  use. 

Previously  [1],  we  presented  our  efforts  on  LPC  analysis  improvements.  The  objective  of  that 
investigation  was  to  improve  the  narrowband  LPC  performance  by  modifying  the  LPC  analysis  without 
increasing  the  data  rate  (2400  bits  per  second  (b/s))  and  without  violating  the  interoperability 
requirements— such  as  the  speech  sampling  rate  and  the  parameter  encoding  format— currently  adopted 
by  DoD.  We  chose  to  work  within  the  confines  of  these  interoperability  requirements  because  they  will 
soon  be  established  as  the  military  standard  (MIL-STD-188-113)  or  the  federal  standard  (FED-STD- 
1015),  and  it  was  hoped  that  our  efforts  could  benefit  the  narrowband  LPC  currently  under  develop¬ 
ment  for  DoD  use. 

In  this  report  we  present  our  efforts  on  LPC  synthesis  improvements  as  the  second  part  of  this 
two-part  series.  The  objective  of  this  investigation  is  to  improve  the  narrowband  LPC  performance  by 
modifying  the  LPC  synthesis  by  using  only  the  data  transmitted  by  the  standard  DoD  narrowband  LPC. 

OVERVIEW  OF  OUR  LPC  SYNTHESIS  IMPROVEMENTS 

Figure  1  shows  that  the  narrowband  LPC  synthesizer  has  three  functional  blocks:  (a)  the  syn¬ 
thesis  filter,  (b)  the  excitation  signal  generator,  and  (c)  the  postsynthesis  processor.  As  we  discuss 
later,  the  excitation  signal  generator  and  the  postsynthesis  processing  are  the  weakest  links  in  the  nar¬ 
rowband  LPC  synthesizer;  we  therefore  concentrate  on  these  two  areas  in  this  report  Three  of  the 
four  improvements  presented  involve  the  excitation  signal;  the  remaining  one  involves  the  postsyn¬ 
thesis  processing.  We  do  not  present  any  items  related  to  improvement  of  the  synthesis  filter  because  it 
is  basically  constrained  by  the  DoD  interoperability  requirements.  The  following  is  an  overview  of  the 
four  improvements  discussed  in  this  report 

Amplitude  Spectrum  Shaping  of  the  Voiced  ExdtaHau  Signal 

The  conventional  excitation  signal  used  to  generate  voiced  speech  is  simply  an  impulse  waveform 
(or  any  other  fixed  waveform  with  a  flat  amplitude  spectrum)  which  is  repeated  at  the  pitch  rate.  The 
use  of  such  an  excitation  would  be  logical  if  the  LPC  analysis  filter  completely  removed  speech 
resonant  frequency  components  so  that  the  prediction  residual  had  a  flat  amplitude  spectral  envelope. 
In  actuality,  the  prediction  residual  retains  a  considerable  amount  of  speech  resonant  frequency  com¬ 
ponents  because  of  limitations  inherent  in  the  linear  predictive  analysis  (i.e.,  the  all-pole  modeling  of 
the  speech  and  the  use  of  a  limited  number  of  filter  weights).  Therefore,  to  generate  more  natural¬ 
sounding  speech,  the  narrowband  LPC  excitation  signal  should  contain  resonant  frequencies  similar  to 

Manuscript  approved  December  27, 1983. 


l*1’"'!'*,  '.Cjft  *.  >*>-■  ♦J‘1  >  VA  '■‘eS"-  S'v'Sy 


■  .-V  A.'A; 


J,T A^1  A»W ‘W- ~  W~t  f  . 


«d 


KANG  AND  EVERETT 


ExcnxnoN  signal  genbiator 


UNVOtCiD  ' 

RAMOOM 

notsf 

ExcnmoN 

SIGNAL 

SOURCE 

1  V; 

QUASt- 
!  PERIODIC 

VOICED  U  • 

EXCITATION  /  j 

_ I 

SwInL 

SOURCE 

_ . 

■  .1 

LPC 

SYNTHESIS 

RUTER 


f  POSTSYNTHESIS I 
t  PROCESS  4 


SPEECH 

OUT 


VOtCSIO 


lOFUGR 

WEIGHTS 


RECEIVED  LPC  PARAMETERS 


Fig.  1  —  Block  ditgram  of  the  narrowband  LPC  synthesizer.  The  shaded  blocks 
are  those  items  we  have  modified  as  discussed  in  this  report. 

those  in  the  prediction  residual.  We  present  a  way  of  introducing  these  resonant  frequencies  into  the 
conventional  narrowband  excitation  signal  for  voiced  speech.  The  amplitude  spectrum  shaping  of  the 
voiced  excitation  signal  produced  a  5.2-point  improvement  in  the  speech  quality  as  evaluated  tty  the 
Diagnostic  Acceptability  Measure  (DAM)  [2].  This  indicates  that  the  resulting  speech  quality  is  com¬ 
parable  to  that  of  a  voice  processor  operating  at  9600  b/s,  or  four  times  the  data  rate  of  the  narrowband 
LPC. 

Phase  Spectrum  Shaping  of  the  Voiced  Excitation  Signal 

The  individual  waveform  of  the  conventional  voiced  excitation  signal  repeats  exactly  from  one 
pitch  cycle  to  the  next  In  contrast,  the  prediction  residual  rarely  repeats  exactly  from  one  pitch  cycle 
to  the  next  This  is  due  to  irregularities  in  vocal  cord  movement  and  turbulent  air  flow  from  the  lungs 
during  the  glottis-open  period  of  each  pitch  cycle.  The  extreme  regularity  of  the  LPC  excitation  signal 
causes  the  synthesized  speech  to  sound  machinelike  and  tense.  To  reduce  this  effect,  pitch  epoch  varia¬ 
tions  and  period-to-period  waveform  variations  may  be  conveniently  realized  by  introducing  phase  jitter 
in  the  waveform.  We  present  a  new  expression  for  the  voiced  excitation  signal  and  specify  the  phase 
jitter  characteristics.  Use  of  this  phase  spectrum  shaping  in  the  voiced  excitation  signal  increased 
overall  quality  DAM  scores  by  4.7  points  for  male  speakers  and  5.0  points  for  female  speakers. 

Modified  Unvoiced  Excitation  Signal 

The  conventional  excitation  signal  for  generating  unvoiced  speech  is  simply  random  noise  with  a 
uniform  or  Gaussian  amplitude  distribution.  Such  an  excitation  produces  satisfactory  nonabrupt 
unvoiced  sounds,  or  continuants,  such  as  /f/,  /s/,  /sh/,  and  /th/.  As  expected,  the  prediction  residuals 
for  these  sounds  are  random,  with  an  approximately  Gaussian  amplitude  distribution.  On  the  other 
hand,  the  prediction  residuals  for  abrupt  consonants  such  as  /k/,  /t/,  and  /ch/  are  spiky  and  irregular, 
especially  in  the  burst  or  onset  portion  of  the  sound.  Therefore  the  satisfactory  production  of  these 
sounds  requires  an  excitation  signal  consisting  of  random  noise  with  at  least  one  large  spike  at  the 
onset.  Without  this  large  spike,  a  synthesized  stop  consonant  usually  sounds  more  like  a  continuant. 
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We  present  a  new  form  of  the  unvoiced  excitation  signal.  Although  similar  to  the  conventional 
unvoiced  excitation  for  the  generation  of  nonabrupt  unvoiced  sounds,  our  excitation  signal  generates 
randomly  spaced  spikes  if  the  speech  root-mean- square  (RMS)  value  changes  sharply  from  one 
unvoiced  frame  to  another.  This  modified  unvoiced  excitation  signal  enhances  the  reproduction  of 
unvoiced  plosives  without  affecting  the  reproduction  of  nonabrupt  unvoiced  sounds. 

The  use  of  the  modified  unvoiced  excitation  signal  improved  the  overall  Diagnostic  Rhyme  Test 
(DRT)  [3]  score  of  the  LPC  by  3.6  points  for  three  female  speakers.  Significantly,  the  partial  score  for 
discriminating  abrupt  vs  nonabrupt  unvoiced  sounds  was  improved  by  14.4  points,  implying  that  we 
have  properly  identified  a  major  weakness  in  the  unvoiced  excitation  signal  and  generated  a  solution  to 
correct  it. 

Expanded  Output  Bandwidth 

Contrary  to  convention,  the  output  bandwidth  of  a  voice  processor  need  not  be  the  same  as  the 
input  bandwidth.  According  to  our  experimentation,  synthesized  speech  is  much  brighter  and  often 
more  intelligible  when  the  output  bandwidth  is  made  greater  than  the  input  bandwidth.  To  accomplish 
this  in  the  narrowband  LPC  without  altering  the  data  rate,  we  folded  the  frequency  contents  of  syn¬ 
thesized  speech  between  2  and  4  kHz  upward  at  4  kHz  to  make  an  output  bandwidth  of  6  kHz,  rather 
than  the  usual  4  kHz.  This  results  in  more  natural  fricative  sounds  and  sharper  stop  consonants. 
Although  this  also  generates  weak  extraneous  formants  in  the  upperband  regions  of  voiced  speech 
sounds,  it  does  not  affect  their  intelligibility,  and  in  fact  adds  brightness  to  their  tonal  quality.  Test 
results  show  that  the  extended  output  bandwidth  produces  a  2.5-point  increase  in  overall  quality  as 
measured  by  the  DAM. 


BACKGROUND 

Over  the  years  numerous  voice  processors  have  been  developed  for  operational  use,  including 
pulse  code  modulators  (PCM)  at  18.75  and  50  kilobits  per  second  (kb/s),  continuously  variable  slope 
delta  (CVSD)  modulators  at  16  and  32  kb/s,  adaptive  predictive  coders  (APC)  at  6.4  and  9.6  kb/s,  and 
the  narrowband  LPC  and  a  channel  vocoder  at  2.4  kb/s.  Today  the  most  commonly  used  data  rates  are 
2.4,  9.6,  and  16  kb/s. 

The  narrowband  LPC  operating  at  2.4  kb/s  is  becoming  a  vital  part  of  the  DoD  voice  communica¬ 
tion  system  because  it  can  provide  adequate  communicability  in  less  than  favorable  operational  environ¬ 
ments.  For  example,  it  can  transmit  speech  over  narrowband  channels  with  a  bandwidth  of  approxi¬ 
mately  3  kHz,  such  as  high  frequency  (HF)  channels,  unequalized  telephone  lines,  or  fieldwires. 
Transmission  over  HF  channels,  which  the  Navy  often  relies  on,  requires  a  simple  low-power 
transmitter  operable  in  shipboard,  airborne,  shelter,  and  vehicular  platforms. 

The  narrowband  LPC  can  also  transmit  speech  more  reliably  over  the  Navy  FLEETSATCOM 
channels  than  can  higher  data  rate  voice  processors.  Because  the  fixed  power  at  the  satellite  relay 
makes  the  signal-to-noise  ratio  at  the  receiver  inversely  proportional  to  the  data  rate,  the  low  data  rate 
of  the  2.4  kb/s  LPC  provides  a  less  noisy  speech  signal. 

Furthermore,  the  narrowband  LPC  provides  better  survivability  in  the  presence  of  man-made  or 
natural  disturbances  in  the  transmission  channel  since  there  are  more  narrowband  channels  available  for 
rerouting  (such  as  public  and  DoD  telephone  lines).  In  addition,  the  2.4  kb/s  narrowband  LPC  actually 
yields  higher  intelligibility  scores  than  some  higher  rate  voice  processors  in  certain  high-noise  environ¬ 
ments.  For  example,  in  a  shipboard  platform  the  average  DRT  score  for  the  narrowband  LPC  is  87.2, 
whereas  it  is  only  80.0  for  the  9.6  kb/s  APC. 
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Because  of  these  advantages,  the  use  of  the  narrowband  LPC  is  expected  to  become  more 
widespread  in  the  future.  Although  the  narrowband  LPC  may  outperform  higher  rate  voice  processors 
in  1ms  favorable  operational  conditions,  it  is  still  inferior  when  operated  in  a  quiet  environment  In 
general,  the  intelligibility  of  narrowband  LPC  speech  is  moderately  good.  The  average  overall  DRT 
scorns  are  about  89  for  male  talkers  and  about  86  for  female  talkers,  which  compare  favorably  with 
those  of  the  9.6  kb/s  APC  (91  for  both  male  and  female  talkers).  However,  the  speech  quality  of  the 
LPC  is  notoriously  poor.  For  example,  the  Composite  Acceptability  Estimate  (CAE)  of  the  Diagnostic 
Acceptability  Measure  (DAM)  for  the  narrowband  LPC  is  about  6  points  lower  than  that  of  the  APC 
for  male  talkers,  and  9  points  lower  for  female  talkers. 

Weaknesses  of  the  Narrowband  LPC  Synthesizer 

The  synthesis  procedure  in  the  narrowband  LPC  is  partly  to  blame  for  the  deficiency  in  speech 
quality  mentioned  above  because  the  model  used  to  generate  the  speech  is  simple  and  unrealistic.  The 
narrowband  LPC  excitation  signal  is  based  on  the  assumption  that  all  speech  can  be  generated  by  using 
either  a  purely  periodic  (voiced)  excitation,  or  a  purely  random  (unvoiced)  excitation.  The  weakness  of 
this  model  becomes  evident  when  it  is  compared  with  the  prediction  residual  representing  the  ideal 
excitation  signal  for  the  LPC  analysis/synthesis  system.  The  prediction  residdal,  unlike  the  narrowband 
LPC  excitation  signal,  is  not  always  periodic,  even  when  the  input  speech  is  a  sustained  vowel.  Like* 
wise,  the  prediction  residual  is  not  always  random  when  the  input  speech  is  unvoiced.  Most  impor¬ 
tantly,  the  prediction  residual  is  a  sample-by-sample  quantity  that  cannot  be  closely  approximated  by  a 
signal  which  is  regenerated  by  using  a  limited  number  of  frame-by-frame  parameters  as  is  the  case  with 
the  narrowband  LPC  excitaion  signal. 

One  way  of  improving  the  excitation  signal  would  be  to  transmit  the  prediction  residual  itself,  as 
in  the  APC  or  the  Navy  Mulitrate  Processor  (MRP)  [4].  However,  to  do  this  requires  a  data  rate  of  at 
least  9.6  kb/s.  Another  way  to  improve  the  excitation  signal  would  be  to  create  a  multipulse  signal  to 
minimize  the  perceptual  difference  between  the  unprocessed  and  the  synthetic  speech  (5).  Still,  the 
required  data  rate  is  well  in  excess  of  2.4  kb/s. 

Because  any  improvements  to  the  narrowband  LPC  must  be  interoperable  with  the  standard  DoD 
narrowband  LPC,  we  do  not  propose  to  use  a  radically  different  excitation  signal.  We  do,  however, 
propose  to  use  a  more  general  form  of  the  excitation  signal  source  from  which  either  the  voiced  or  the 
unvoiced  excitation  signal  or  a  hybrid  signal  resembling  both,  may  be  generated.  This  modified  excita¬ 
tion  signal  source  has  more  control  variables  than  the  conventional  source,  allowing  more  freedom  in 
specifying  its  characteristics. 

Modified  Excitation  Signal  Source 

The  conventional  excitation  signal  is  divided  into  two  mutually  exclusive  parts:  a  broadband 
repetitive  signal  to  generate  voiced  speech  and  a  broadband  random  signal  to  generate  unvoiced  speech. 
The  choice  between  the  two  excitation  signals  is  determined  by  the  (binary)  voicing  decision;  the 
repetitive  rate  of  the  voiced  excitation  signal  is  governed  by  the  pitch  frequency. 

In  contrast,  our  modified  excitation  signal  is  not  rigidly  divided  into  two  classes— the  voiced  exci¬ 
tation  signal  contains  some  random  components,  and,  likewise,  the  unvoiced  excitation  signal  contains 
some  deterministic  components.  This  hybrid  form  of  excitation  signal  is  much  closer  to  the  actual  voic¬ 
ing  excitation  than  is  the  conventional  divided  signal.  As  we  show,  the  presence  of  these  complemen¬ 
tary  components  improves  the  naturalness  and  quality  of  the  synthesized  speech. 

In  essence,  the  conventional  excitation  signal  is  a  stationary  model  of  our  excitation  signal.  The 
conventional  signal  is  generated  under  the  assumptions  that  (a)  the  amplitude  spectrum  is  fiat  and 
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time-invariant,  (b)  the  phase  spectrum  of  the  voiced  excitation  signal  is  a  time-invariant  function  of 
frequency,  and  (c)  the  phase  spectrum  of  the  unvoiced  excitation  signal  has  a  probability  function  that 
is  time  invariant.  1  hese  assumptions  make  it  possible  to  generate  a  replica  of  the  voiced  excitation  sig¬ 
nal  which  can  be  stored  in  memory  and  read  out  sequentially  at  every  voiced  pitch  epoch.  Similarly, 
unvoiced  excitation  signal  samples  are  read  out  randomly  from  a  table  containing  uniformly  distributed 
random  numbers. 

In  our  modified  excitation  signal  we  do  not  use  "canned*  samples  with  invariant  characteristics. 
Instead  we  generate  new  excitation  signal  samples  at  each  pitch  epoch,  or  at  a  fixed  time  interval  if  the 
speech  is  unvoiced,  based  on  the  updated  amplitude  and  phase  spectra  of  the  excitation.  This  excitation 
signal  is  based  on  the  Fourier  series;  thus  the  nth  excitation  sample  e(i)  is  given  by 


e(i)  -  T  a(k)  cos 


i  +  <t>(k)  ,  !<:/<:/ 


where  a(k )  and  <t>(k)  are  the  Jlcth  amplitude  and  phase  spectral  components,  respectively,  1  is  the 
number  of  excitation  signal  simples,  and  K  is  the  number  of  amplitude  or  phase  spectral  components. 
The  quantity  K  is  related  to  /  by 


y  +  1  if  /  is  even 


if  /  is  odd. 


Equation  (1)  is  the  most  general  form  of  the  excitation  signal.  It  represents  the  excitation  signal 
not  only  for  the  narrowband  LPC,  but  also  for  the  wideband  LPC  as  in  the  previously  mentioned  Navy 
MRP  [4].  In  the  MRP,  the  quantity  /  in  Eq.  (1)  is  the  frame  width,  and  both  the  amplitude  and  phase 
spectral  components,  a  Or)  and  $(*),  are  derived  from  the  actual  prediction  residual.  Thus,  the  result¬ 
ing  speech  quality  (at  16  kb/s)  is  excellent. 

The  conventional  narrowband  LPC  excitation  signal  may  also  be  expressed  by  Eq.  (1).  In  this 
representation,  the  voicing  decision  is  mapped  onto  the  phase  spectrum.  Thus,  the  conventional  excita¬ 
tion  signal  in  the  form  of  Eq.  (1)  has  two  different  phase  spectra  since  it  is  controlled  by  a  two-state 
voicing  decision.  Table  1  gives  the  general  characteristics  of  these  two  types  of  phase  spectra.  As  we 
will  show,  these  correspond  to  the  stationary  parts  of  the  phase  spectrum  of  our  modified  excitation  sig¬ 
nal  for  the  respective  voicing  modes.  The  amplitude  spectrum  is,  of  course,  fiat  and  time  invariant. 

Our  modified  excitation  signal  will  have  spectral  properties  as  described  in  Table  1.  The  methods 
for  generating  these  characteristics  and  the  rationale  behind  them  are  discussed  in  a  subsequent  section 
of  this  report 

The  duration  of  the  narrowband  LPC  excitation  signal  is  denoted  by  /  in  Eq.  (1).  If  the  speech  is 
voiced,  the  quantity  /  corresponds  to  the  length  of  the  pitch  period  as  received  by  the  synthesizer.  If 
the  speech  is  unvoiced,  there  is  by  definition  no  pitch  period,  so  we  assign  a  fixed  time  interval,  similar 
to  a  pitch  period,  to  periodically  renew  the  unvoiced  excitation  signal  and  to  periodically  interpolate  the 
LPC  parameters. 

The  unvoiced  excitation  signal  is  dispersed  over  the  entire  time  interval  because  its  phase  spectral 
components  are  randomly  distributed  (see  Table  1).  However,  this  is  not  the  case  with  the  voiced  exci¬ 
tation  signal.  For  example,  if  we  assume  that  the  amplitude  spectrum  is  flat  and  the  phase  spectrum  is 
a  linear  function  of  frequency,  then  the  resulting  voiced  excitation  signal  is  an  impulse,  meaning  that 
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Table  1  —Summary  of  Narrowband  LPC  Excitation  Signal  Parameters 


N 

% 

Parameters 

Conventional  Narrowband 

LPC  Excitation  Signal 

Our  Modified  Narrowband 
LPC  Excitation  Signal 

% 

> 

Amplitude 

Frequency-independent 

With  weak  resonant 

\ 

< 

Spectrum 

and  time-invariant 

frequencies  updated 

a(k) 

(assigned  parameter) 

pitch-synchronously 

n 

A  nonlinear  function  of 

A  quadratic  function  of 

V. 

SnAftrh 

frequency,  and  time-invariant 

frequency,  with  frequency- 

(assigned  parameter) 

dependent  phase  jitters 

\ 

Phase 

A  stationary  random  process 

Spectrum 

with  a  uniform  distribution 

K 

*(*) 

Unvoiced 

N/A* 

between  —n  and  rr  radians. 

P-  - 

Speech 

superimposed  by  amplitude- 

weighted,  randomly  spaced 

u* 

pulses. 

»*• 

Signal 

Pitch  period 

Pitch  period 

C 

Duration 

(received  parameter) 

(received  parameter) 

(/) 

*Most  commonly,  the  convention*!  unvoiced  excitation  signal  is  read  out  randomly  from  a  table  containing  uni¬ 
formly  distributed  random  numbers.  Its  phase  spectrum  cannot  be  expressed  conveniently  in  terms  of  Eq.  (1). 


only  one  out  of  /excitation  samples  is  nonzero.  The  spread  of  the  voiced  excitation  signal  is  dependent 
on  the  phase  spectrum.  We  present  a  preferred  phase  spectrum  for  the  voiced  excitation  signal  in  a 
later  section  of  this  report. 


Test  and  Evaluation  of  Synthesized  Speech 

Even  though  there  is  no  "speech  quality  meter”  that  automatically  indicates  the  quality  of  synthetic 
speech,  tests  using  known  quality  evaluation  methods,  such  as  the  DAM  test,  are  time-consuming,  par¬ 
ticularly  when  the  processor  does  not  run  in  real  time.  For  this  reason,  researchers  often  perform  so- 
called  'informal  listening  tests.”  This  method  can  indicate  speech  quality  when  done  by  using  naive 
listeners,  but  such  tests  can  be  rather  misleading  when  the  researchers  themselves  act  as  listeners 
because  their  ears  have  been  conditioned  to  the  electronic  accents  of  their  own  voice  processors. 
Furthermore,  the  aspect  of  speech  they  are  trying  to  improve  may  be  easily  heard  by  the  researchers 
but  imperceptible  to  casual  or  untrained  listeners.  Therefore,  it  is  essential  to  use  established  test 
methods  for  quality  evaluation. 


However,  quality  evaluation  using  established  methods  is  not  all  that  is  needed;  one  must  check 
carefully  to  be  sure  that  a  change  in  one  aspect  of  the  voice  processing  does  not  degrade  another  area. 

For  example,  filtering  out  the  synthesized  speech  components  below  approximately  2S0  Hz  produces  a 
more  spectrally  balanced  sound  for  the  narrowband  LPC.  Many  listeners  prefer  this  because  the 
absence  of  a  heavy  bass  component  makes  the  upper  frequency  contents  more  noticeable  and  intelligi¬ 
ble.  However,  such  an  alteration  must  be  tested  for  potentially  adverse  effects  on  pitch  and  voicing 
estimation  when  the  LPC  is  operated  in  tandem  with  another  narrowband  LPC.  Likewise  any 
modification  to  one  aspect  of  the  speech  must  be  tested  for  effects  on  other  aspects.  Frequently  an 
improvement  in  subjective  speech  quality  degrades  the  measured  speech  intelligibility. 

(n  this  report  we  have  chosen  to  use  evaluation  methods  that  are  sensitive  to  the  specific  aspects  m 
of  speech  we  are  trying  to  improve.  For  example,  the  Diagnostic  Rhyme  Test  (DRT),  which  measures 
the  intelligibility  of  initial  consonants,  would  not  be  the  best  method  to  use  for  evaluating  the  quality  of 
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synthesized  speech.  A  much  better  evaluation  could  be  made  by  using  a  method  such  as  the  Diagnostic 
Acceptability  Measure  (DAM)  that  is  specially  designed  to  be  sensitive  to  speech  quality. 

With  the  DAM,  a  system  is  rated  by  using  12  phonetically  balanced  6-syllabie  sentences  from 
each  talker.  A  listener  hears  the  12  sentences  as  a  group,  and  then  rates  the  overall  voice  quality  on  21 
separate  rating  scales  which  describe  the  speech  quality,  the  background  noise,  and  the  total  effect  of 
the  voice  signal  (e.g.,  nasal,  unnatural,  crackling,  intelligible).  All  the  scales  are  combined  into  an 
overall  composite  score.  Also,  a  number  of  diagnostic  scales  related  to  the  perceptual  quality  of  the 
speech  signal  and  the  background  noise  (such  as  fluttering,  muffled,  hissy)  are  computed  based  on  vari¬ 
ous  subsets  of  the  test  scales. 

Both  the  DAM  and  the  DRT  use  standard  tape  recordings  and  are  scored  by  Dynastat,  Inc.  in 
Austin,  Texas,  which  maintains  a  stable  crew  of  trained  listeners.  In  this  way  we  may  compare  our 
results  with  those  obtained  at  different  times  by  other  researchers.  Because  these  tests  measure 
different  aspects  of  the  speech,  both  have  become  indispensable  tools  for  evaluating  the  quality  and 
intelligibility  of  voice  processing  systems  in  the  DoD  community. 

Past  Improvements  to  the  LPC  Synthesis 

It  has  been  nearly  a  decade  since  the  Navy  and  others  first  implemented  the  narrowband  LPC  for 
real-time  operation.  Since  then  there  have  been  many  improvements  related  to  the  narrowband  LPC 
synthesis.  The  current  DoD  standard  narrowband  LPC  has  incorporated  many  of  the  earlier  changes 
developed  both  by  DoD  scientists  and  by  R&D  firms  for  their  DoD  sponsors  [6,7].  All  these  improve¬ 
ments  are  supported  by  rational  principles  as  outlined  in  their  respective  articles  and  reports.  The 
features  do  not  adversely  affect  other  aspects  of  the  narrowband  speech  and  we  recommend  them  for 
any  narrowband  voice  processor.  They  include  the  following: 

•  the  use  of  pitch-synchronous  parameter  interpolation  to  make  the  synthetic  speech  sound 
cleaner, 

•  fixed-power  excitation  and  postsynthesis  amplitude  calibration  to  enhance  computational 
accuracy, 

•  the  use  of  a  time-dispersed  voiced  excitation  signal  to  reduce  the  speech  dynamic  range 
and  improve  the  tandem  performance  with  a  continuously  variable  slope  delta  (CVSD) 
processor, 

•  the  use  of  the  speech  power,  rather  than  the  excitation  signal  power,  as  an  amplitude 
parameter  to  eliminate  speech  amplitude  variations  caused  by  transmission  errors  in  LPC 
coefficients,  and 


•  nonlinear  interpolations  of  LPC  coefficients  and  the  amplitude  parameter  to  highlight  sud¬ 
den  speech  transitions  and  make  them  sound  crisper. 

Despite  all  these  improvements,  the  speech  quality  of  the  narrowband  LPC  is  still  somewhat  poor, 
and  the  intelligibility  of  female  voices  remains  lower  than  that  of  male  voices.  This  report  addresses 
improvements  in  these  areas. 

AMPLITUDE  SPECTRUM  SHAPING  OF  THE  VOICED  EXCITATION  SIGNAL 

The  amplitude  spectrum  of  the  synthesized  speech  is  the  product  of  the  amplitude  spectrum  of 
the  excitation  signal  and  the  frequency  response  of  the  synthesis  filter.  Thus  the  quality  of  the  syn¬ 
thesized  speech  is  directly  dependent  on  both  these  factors.  Our  objective  in  this  section  is  to  deter¬ 
mine  the  best  amplitude  spectrum  of  the  excitation  signal  to  use  in  the  narrowband  LPC  in  an  effort  to 
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generate  the  highest  quality  synthetic  speech  without  compromising  the  DoD  interoperability  require¬ 
ments. 


In  the  conventional  narrowband  LPC  the  amplitude  spectrum  of  the  excitation  signal  is  always 
flat,  both  for  the  voiced  and  the  unvoiced  excitations  (i.e.,  a  (k)  is  a  nonnegative  constant  for  all  As  in 
Eq.  (1)).  However,  in  looking  at  the  prediction  residual  as  the  ideal  excitation  signal  for  the  LPC,  we 
notice  that  its  amplitude  spectrum  is  not  flat  at  all,  especially  for  voiced  speech. 

The  prediction  residual  for  voiced  speech  contains  a  considerable  number  of  resonant  frequency 
components,  similar  to  those  in  the  original  speech  but  lower  in  intensity  (Figs.  2(a)  and  2(b)).  The 
presence  of  these  resonant  frequencies  makes  the  prediction  residual  itself  highly  intelligible.  In  fact, 
an  average  DRT  score  of  83.S  was  obtained  by  using  only  the  prediction  residual  for  a  set  of  three  male 
speakers  (one  speaker  scored  as  high  as  87.0).  Without  similar  resonant  frequency  components  in  the 
excitation  signal,  the  synthesized  speech  tends  to  sound  fuzzy  and  somewhat  lacking  in  clarity. 


WE  THMK 


WALKING 


IS  GOOD 


EXERCISE 


(b)  Prediction  residual  (ideal  excitation  signal) 


III  III  H 


(d)  Our  voiced  excitation  signal  Tor  narrowband  LPC 

Fig.  2  —  Spectra  of  original  speech  and  LPC  excitation  signals.  The  prediction  residual  con¬ 
tains  a  considerable  number  of  resonant  frequency  components  unfiltered  by  the  LPC  analysis 
filter;  the  conventional  voiced  excitation  signal  contains  no  resonant  frequencies.  Our  voiced 
excitation  signal  has  weak  traces  of  resonant  frequencies  similar  to  those  of  the  prediction  re¬ 
sidual,  making  Use  synthesized  speech  sound  more  natural. 
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Resonant  Frequencies  in  the  Prediction  Residual 


In  the  narrowband  LPC  the  task  of  the  linear  predictive  analysis  is  to  represent  the  talker’s  vocal 
tract  in  the  form  of  an  all-pole  filter.  The  transfer  function  of  the  LPC  analyzer  transforms  the  speech 
waveform  to  the  prediction  residual  waveform.  Thus  the  residual  spectrum  R  (2),  stated  in  terms  of 


the  speech  spectrum  £(z),  is 

N 

*(z)- 

E(z). 


(3) 


The  spectral  envelope  of  the  residual  is  flat  (i.e.,  R  (z)  is  a  constant)  only  when  the  speech  spectral 
envelope  is  represented  perfectly  by  the  all-pole  spectrum  H(z)  expressed  by 
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where  H(.z )  is  equal  to  the  transfer  function  of  the  LPC  synthesizer,  a,  is  the  nth  prediction 
coefficient,  and  (z„,  z*)  is  a  complex  conjugate  pair. 

Because  of  the  complex  nature  of  the  speech  spectrum,  the  residual  spectral  envelope  R  (z)  is 
rarely  flat.  This  is  caused  in  part  by  the  presence  of  antiresonant  components  (zeros)  in  the  speech 
waveform  which  will  not  be  greatly  affected  by  the  LPC  analysis  filter.  Figure  2  illustrates  that  the 
prediction  residual  also  contains  considerable  resonant  frequency  components  not  removed  by  the 
analysis  filter.  There  are  two  major  reasons  for  this.  Fust,  the  magnitudes  of  the  resonant  peaks  of  an 
all-pole  filter,  such  as  the  LPC  synthesis  filter,  are  dependent  on  the  pole  locations  (see  Eq.  (5));  they 
cannot  be  independently  controlled  as  they  can  in  a  parallel  formant  synthesizer.  In  other  words,  for  a 
given  set  of  pole  locations,  the  magnitudes  of  the  resonant  peaks  are  predetermined  and  cannot  be 
altered  without  actually  shifting  the  poles.  We  have  observed  that  the  formant  amplitudes  in  the  LPC 
synthesizer  are  often  lower  than  those  of  the  actual  speech.  The  greater  the  magnitude  of  the  original 
formants,  the  stronger  the  resonant  frequency  components  in  the  prediction  residual.  Therefore  a  voice 
with  unusually  intense  formant  frequencies  will  not  be  reproduced  well  by  the  narrowband  LPC  unless 
the  excitation  signal  is  augmented  with  formant  frequencies  similar  to  those  in  the  prediction  residual. 


The  second  reason  why  the  prediction  residual  contains  considerable  resonant  frequencies  is  due 
to  the  quantization  of  the  filter  coefficients  which  tends  to  reduce  the  spectral  peaks  attained  by  an  all¬ 
pole  filter  (Fig.  3).  This  reduction  is  partly  due  to  the  clipping  of  LPC  coefficients  by  the  LPC  quan¬ 
tizer.  Again,  the  differentials  in  the  spectral  peaks  will  appear  as  formant  frequencies  in  the  prediction 
residual.  (Figure  3  is  based  on  the  coefficient  quantization  rule  for  the  DoD  standard  narrowband  LPC, 
but  all  other  parameter  quantization  rules  designed  for  the  2.4  kb/s  LPC  produce  similar  results.) 


When  the  resonant  frequency  components  in  the  prediction  residual  are  not  present  in  the  excita¬ 
tion  signal,  the  synthesized  speech  lacks  clarity.  Because  the  amplitude  spectrum  of  the  conventional 
voiced  excitation  signal  is  flat  (Fig.  2(c))  the  synthesized  formants  are  noticeably  muddier  than  those  in 
the  original  speech.  We  have  therefore  developed  a  voiced  excitation  signal  containing  resonant  fre¬ 
quencies  which  improves  the  quality  of  the  synthesized  speech.  Figure  2(d)  shows  that  these  resonant 
frequencies  are  similar  to  those  contained  in  the  prediction  residual. 


Earlier  Experimentation  with  Amplitude  Shaping 

We  observed  resonant  frequencies  in  the  prediction  residual  as  early  as  1972  when  we  first  imple¬ 
mented  a  narrowband  LPC  based  on  the  flow-form  LPC  implementation  [8].  Unlike  the  block-form 


(b)  Amplitude  reaponae  of  qmtbeiis  filter 

Fig.  3  —  Effect  of  LPC  coefficient  quantization  on  the  amplitude  reeponee  of  the  eyntheaia 
filter.  Quantization  of  LPC  coefficients  reaulta  in  a  reduction  of  resonant  peaks  in  the  syn¬ 
thesis  filter. 

LPC  implementation  [6,7],  which  is  often  employed  because  it  requires  fewer  computational  steps,  the 
flow-form  LPC  analysis  generates  the  prediction  residual  as  a  by-product  of  the  filter  coefficient  estima¬ 
tion.  We  were  surprised  to  find  that  the  prediction  residual  contained  significant  resonant  frequencies 
(see  Fig.  7  of  Ref.  8),  and  was  highly  intelligible.  We  realized  that  narrowband  LPC  speech  could  best 
be  improved  by  introducing  some  of  these  resonant  frequencies  into  the  excitation  signal. 

We  investigated  methods  of  shaping  the  amplitude  spectrum  of  the  conventional  LPC  excitation 
signal  in  1975.  An  experimental  3.6  kb/s  LPC  system  computed  eight  additional  LPC  coefficients  from 
the  prediction  residual  and  encoded  them  into  1.2  kb/s.  These  eight  coefficients  were  then  transmitted 
along  with  the  conventional  2.4  kb/s  LPC  data.  The  sound  quality  of  this  3.6  kh/s  LPC  was  noticeably 
better  than  that  of  die  conventional  2.4  kb/s  LPC— it  was  clearer,  less  muffled,  and  allowed  better 
speaker  recognition.  Since  we  are  limited  to  2.4  kb/s  in  the  current  investigation,  we  developed  a  way 
to  achieve  similar  improvements  in  speech  quality  without  transmitting  any  additional  data  derived  from 
the  prediction  residual.  This  is  a  theoretical  impossibility;  however,  an  approximate  shaping  of  the 
excitation  signal  is  possible  because  the  resonant  frequencies  in  the  prediction  residual  track  closely 
with  those  of  the  original  speech  (see  again  Fig.  2). 

Amplitude  Spectrum  Modification  of  the  Voiced  Exdtaton  Signal 

Since  we  are  concerned  here  only  with  the  resonant  frequencies  in  the  excitation  signal,  and  not 
with  the  antiresonances,  the  most  convenient  form  of  spectral  representation  is  the  all-pole  spectrum. 
Thus,  let  the  amplitude  spectrum  of  the  modified  excitation  signal  be  expressed  by 

Ml) - y -  <6) 

where  y,  is  the  nth  prediction  coefficient.  Ideally  yn  would  be  obtained  from  the  prediction  residual. 
As  noted  from  Eq.  (6),  the  amplitude  spectrum  of  the  modified  excitation  signal  is  similar  in  form  to 
the  LPC  synthesis  filter  H(z): 
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H(z)  -  - ^ -  (4) 

1  -  T 

rt-1 

where  a„  is  the  nth  prediction  coefficient  obtained  from  the  speech. 

While  a„  is  available  at  the  narrowband  LPC  receiver,  y„,  which  is  needed  for  the  amplitude 
spectral  modification,  is  not.  We  must  therefore  approximate  y„  from  a„  as  best  we  can.  To  do  this, 
we  exploit  two  observations. 

The  first  is  that  the  predominant  resonant  frequencies  of  the  prediction  residual  track  closely  with 
those  of  the  original  speech,  as  illustrated  in  Fig.  2.  This  is  why  the  prediction  residual  is  so  intelligible. 
While  the  prediction  residual  has  extraneous  resonant  frequencies  not  found  in  the  original,  omission 
of  these  does  not  seem  to  have  a  significant  impact  on  the  output  speech.  However  the  resonant  peaks 
in  the  prediction  residual  are  nearly  equalized,  unlike  those  of  the  original  speech.  Thus  the  all-pole 
spectrum  of  the  prediction  residual  may  be  approximated  by  the  all-pole  spectrum  of  the  speech  with  a 
reduced  feedback  gain: 

A(z) - * -  G<1  (7) 

1  -  G  Jj  a„z-' 

where  a,  is  the  nth  prediction  coefficient  of  the  speech  available  at  the  LPC  synthesizer.  The  factor  G 
is  related  to  the  overall  reduction  pole  moduli.  Since  the  root  loci  of  A  (z)  do  not  lie  along  the  radial 
direction  there  will  be  a  slight  but  insignificant  shift  in  the  frequency  of  the  resonant  peaks. 

The  second  observation  is  that  the  residual  formant  peaks  become  smaller  as  the  prediction  re¬ 
sidual  becomes  more  random.  This  occurs  with  front  vowels,  murmurs  and  nasals,  where  the  speech 
waveform  may  be  well  approximated  by  one  or  two  exponentially  decaying  sinusoidal  functions.  For 
these  speech  waveforms  the  efficiency  of  the  linear  prediction  is  fairly  high,  so  that  the  residual  RMS  is 
relatively  small  for  a  given  speech  RMS.  Thus,  it  is  natural  to  assume  that  the  modulus  reduction  fac¬ 
tor  is  proportional  to  the  ratio  of  the  residual  RMS  to  the  speech  RMS,  namely 

(8) 

where  G ’  is  the  proportionality  constant  yet  to  be  determined,  the  factor  under  the  radical  is  the  ratio  of 
the  residual  RMS  to  the  speech  RMS,  and  w„  is  the  nth  reflection  coefficient  received  by  the  nar¬ 
rowband  LPC.  (Note  that  the  current  narrowband  LPC  transmits  reflection  coefficients  as  the  synthesis 
filter  weights.  The  prediction  coefficients  are  obtained  through  transformation  of  the  reflection 
coefficients  at  the  receiver.) 

The  proportionality  constant  G'  in  Eq.  (8)  is  estimated  by  minimizing  the  mean-square  difference 
between  A{z)  of  Eq.  (6)  and  A (z)  of  Eq.  (7).  We  chose  the  frequency-domain  computational 
approach  because  it  enabled  us  to  exclude  the  effect  of  frequency  components  below  ISO  Hz  which 
were  not  audible  at  the  narrowband  LPC  output.  We  used  approximately  1200  frames  of  male  and 
female  voiced  speech  samples  to  obtain  a  preferred  value  for  G'.  Not  surprisingly,  Table  2  shows  that 
G'  varies  from  speaker  to  speaker.  According  to  this  table,  a  reasonable  choice  for  G'  would  be  some¬ 
where  around  0.25,  even  though,  from  listening  to  processed  speech  while  varying  G'  from  0  to  1.0,  it 
appears  that  there  is  a  broad  range  of  acceptable  values  for  G'. 

The  excitation  spectrum  defined  by  Eq.  (7)  may  be  incorporated  in  the  narrowband  LPC  in  two 
ways:  one  is  a  direct  method  in  which  the  amplitude  spectral  components  in  the  excitation  signal  model 
in  Eq.  (1)  are  matte  equal  to  the  amplitude  spectrum  of  Eq.  (7);  the  other  is  an  indirect  method  in 
which  the  amplitude  spectral  components  in  Eq.  (1)  are  constants,  but  the  amplitude  spectrum  is 
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Table  2— Statistics  of  Proportionality 
Constant  Used  in  Eq.  (8) 


Speakers 

Mean  Value 

Standard  Deviation 

Female 

1 

' '  • '  .  ; 

1 

1 

0.366 

0.325 

Male 

0.117 

0.101 

0.065 

WMSSSM 

0.062 

4  Note:  For  each  speaker,  approximately  100  framei  were 

used  to  generate  both  the  mean  value  and  standard  devia- 
’  lion. 

|  modified  by  passing  the  flat-spectrum  excitation  signal  through  an  all-pole  filter  whose  transfer  function 

is  described  by  Eq.  (7).  We  tried  both  methods  and  noted  virtually  no  difference  in  the  sound  quality. 

Test  and  Evaluation 

|  We  incorporated  the  amplitude  spectral  modification  of  the  voiced  excitation  signal  in  NRL’s  pro- 

l  grammable  real-time  narrowband  voice  processor  and  in  another  voice  processor  currently  under 

development.  We  used  the  Diagnostic  Acceptability  Measure  (DAM)  to  evaluate  the  speech  quality  of 
these  two  systems.  Both  tests  yielded  virtually  identical  results.  A  5-point  improvement  was  shown  in 
the  overall  DAM  scores,  indicating  that  the  speech  quality  of  our  modified  LPC  is  closer  to  that  of  the 

9.6  kb/s  APC  than  to  the  conventional  2.4  kb/s  narrowband  LPC  (Fig.  4). 

i 

Though  we  did  not  expect  the  amplitude  spectrum  modification  of  the  voiced  excitation  signal  to 
noticeably  affect  consonant  intelligibility,  we  nevertheless  conducted  Diagnostic  Rhyme  Tests  (DRTs) 
to  ensure  that  it  did  not  hurt  the  speech  intelligibility.  The  DRT  scores  for  three  male  and  three 
female  speakers  in  a  quiet  environment  were  87  both  with  and  without  the  amplitude  spectrum 
modification.  Likewise,  the  DRT  scores  for  three  male  speakers  in  a  shipboard  environment  were  vir¬ 
tually  unchanged:  78  with  modification  and  77  without  modification.  These  results  confirm  that  our 
amplitude  spectral  modification  of  the  voiced  excitation  signal  significantly  improves  the  quality  of  the 
narrowband  LPC  speech  without  affecting  the  intelligibility. 


PHASE  SPECTRUM  SHAPING  OF  THE  VOICED  EXCITATION  SIGNAL 

Before  there  was  a  convenient  way  to  generate  complex  signals  with  independently  controlled 
phases  it  was  thought  that  the  human  ear  was  phase  deaf.  Today  we  can  adjust  the  phase  spectrum  of  a 
complex  waveform  easily,  and  studies  have  found  that  the  phase  relationships  between  tones  do  have 
some  influence  on  the  perceived  sound  quality.  For  example,  every  experienced  organist  prefers  the 
sound  of  an  organ  having  individual  oscillators  (such  as  Conn,  Allen  and  Rodger  organs)  over  the 
sound  of  an  organ  with  only  12  master  oscillators  that  regenerate  all  the  harmonically  related  tones 
(such  as  Baldwin  or  Hammond  organs).  Though  difficult  to  describe,  there  is  something  more  pleasing 
about  complex  waveforms  with  incoherent  phases. 
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Pig.  4  —  DAM  scores  for  the  2.4  kb/s  narrowband  LPC.  This 
figure  illustrates  the  degree  of  improvement  in  the  speech  qual¬ 
ity  as  a  result  of  the  amplitude  spectral  modification  of  the 
voiced  excitation  signal  in  the  2.4  kb/s  LPC.  For  purposes  of 
illustration,  the  DAM  scorea  for  the  9.6  kb/s  A  PC  voice  proces¬ 
sor  are  also  shown. 


Similarly,  a  number  of  practitioners  in  the  speech  analysis  and  synthesis  fields  have  observed  that 
the  perceptual  quality  of  synthetic  speech  depends  to  some  extent  on  the  phase  spectrum  of  the  voiced 
excitation  signal  [9].  Some  have  even  observed  that  a  reduction  of  peakiness  in  the  voiced  excitation 
signal,  which  is  related  to  the  phase  spectrum,  results  in  a  reduction  of  buzziness  in  the  synthetic 
speech  [10].  In  any  case,  the  phase  spectrum  of  the  voiced  excitation  signal  does  not  affect  the  pitch 
Ill]. 

Ideally,  the  phase  spectrum  of  the  voiced  excitation  signal  should  be  the  phase  spectrum  of  the 
pitch-synchronously  windowed  prediction  residual  with  a  window  width  equal  to  the  pitch  period.  If 
both  amplitude  spectra  are  equal,  the  resulting  excitation  signal  is  equal  to  the  prediction  residual  of 
one  pitch  period— the  ideal  excitation  signal  for  a  pitch-excited  LPC  or  an  LPC  that  repeats  the  voiced 
excitation  signal  at  the  pitch  rate.  Actually,  some  researchers  have  suggested  using  the  median 
differential  delay  of  the  pitch-synchronously  windowed  prediction  residual  (defined  as  the  first  deriva¬ 
tive  of  the  phase  spectrum  with  respect  to  frequency)  [12,13]  to  determine  the  preferred  phase  spec¬ 
trum  of  the  excitation  signal.  The  median  delay  is  an  approximately  linearly  ascending  function  of  fre¬ 
quency,  with  a  total  increment  of  delay  of  roughly  1.2  ms  from  0  Hz  to  the  upper  cutoff  frequency  of 
3.2  kHz.  The  resulting  sound  quality  is  reported  to  be  more  natural  than  when  a  constant  differential 
delay  of  zero  (i.e.,  an  impulse  train)  is  used.  As  it  turns  out,  the  stationary  part  of  the  differential 
delay  of  our  voiced  excitation  signal  is  quite  similar  to  the  median  delay  of  the  pitch-synchronously 
windowed  prediction  residual  mentioned  above.  We  use  a  time-dispersed  voiced  excitation  signal  for 
two  reasons:  (a)  to  improve  the  performance  in  tandem  with  continuously  variable  slope  delta  (CVSD) 
systems,  and  (b)  to  best  use  the  available  dynamic  range  of  the  (arithmetic)  processor  used. 


KANO  AND  EVERETT 


The  time-invariant  portion  of  the  phase  spectrum  discussed  above  fully  specifies  the  conventional 
voiced  excitation  signal.  The  phase  spectrum  of  our  voiced  excitation,  however,  has  an  additional 
time-variant  portion  to  accomodate  a  small  amount  of  waveform  variation  from  one  pitch  cycle  to  the 
next.  These  period-to-period  waveform  variations,  often  referred  to  as  pitch  jitter,  are  caused  in  part  by 
irregularities  in  vocal  cord  movement,  and  in  part  by  the  turbulent  air  flow  from  the  lungs  during  the 
glottis-open  period  of  each  cycle.  The  amount  of  jitter  varies  with  the  fundamental  pitch  frequency,  the 
age  of  the  speaker,  his  or  her  nervous  condition,  and  the  degree  of  muscular  elasticity. 

Without  an  appropriate  amount  of  pitch  jitter,  the  synthetic  speech  sounds  unnatural  in  several 
ways.  First,  it  sounds  flat  and  machinelike  because  the  waveform  is  too  similar  from  one  pitch  cycle  to 
the  next.  Second,  the  synthetic  speech  sounds  heavy  and  buzzy  because  of  a  lack  of  change,  or  flutter, 
particularly  in  the  higher  pitch  harmonics.  A  combination  of  these  characteristics  makes  the  synthetic 
speech  sound  edgy  and  tense,  though  most  people  are  only  subconsciously  aware  of  it. 

This  last  effect  deserves  special  attention  because  of  its  particularly  insidious  nature.  When  we 
look  at  the  structure  of  a  soothing,  mellifluous  voice  like  President  Reagan’s,  we  immediately  notice 
that  such  a  voice  lacks  the  strong,  regular  pitch  harmonics  so  prevalent  in  the  synthetic  LPC  speech. 
We  believe  this  is  due  to  the  presence  of  a  certain  amount  of  breath  air  during  the  glottis-open  period, 
which  introduces  flutter  in  the  high-frequency  pitch  harmonics.  On  the  other  hand,  strong,  regular 
pitch  harmonics  similar  to  those  of  the  LPC  synthesized  speech  are  characteristic  of  sharp,  clear  voices 
like  Paul  Harvey’s,  and  of  speakers  who  are  tense  or  angry.  This  is  probably  caused  by  a  stiffening  of 
the  vocal  cord  muscles. 

Figures  5  through  7  vividly  illustrate  how  the  speech  and  prediction  residual  waveforms  differ  in 
unusually  mellow,  normal,  and  tense  voices  for  both  male  and  female  speakers.  Note  that  the  period¬ 
icity  of  the  prediction  residual,  particularly  that  of  the  high-passed  prediction  residual,  is  progressively 
better  defined  as  the  tenseness  of  the  voice  increases.  In  very  tense  voices  the  prediction  residual  looks 
much  like  the  conventional  voiced  excitation  signal  used  in  the  narrowband  LPC  (see  Fig.  8).  We 
believe  this  is  one  of  the  reasons  LPC  speech  sounds  unnecessarily  tense  regardless  of  the  quality  of 
the  speaker’s  voice. 

All  these  observations  lead  us  to  the  conclusion  that  a  small  amount  of  irregularity  in  the  nar¬ 
rowband  LPC  speech  is  highly  desirable.  A  similar  conclusion  was  reached  by  Makhoul  et  al.  [14],  who 
introduced  irregularity  in  LPC  synthesized  speech  by  using  a  mixed  source  in  which  the  periodic  pulse 
train  was  low-pass  filtered  while  the  noise  was  high-pass  filtered  at  the  same  cutoff  frequency.  The 
cutoff  frequency  was  variable  and  was  estimated  to  be  the  highest  frequency  at  which  the  speech  spec¬ 
trum  was  considered  periodic.  This  cutoff  frequency  was  quantized  into  2  or  3  bits  and  transmitted  to 
the  receiver.  The  frequency  quantization  step  was  as  coarse  as  500  Hz,  and  low-order  Butterworth 
filters  were  used.  According  to  the  authors,  the  above  mixed  excitation  source  appeared  to  reduce  two 
seemingly  different  types  of  buzziness:  the  first  was  the  quality  of  synthetic  voiced  fricatives;  the 
second  was  the  buzziness  of  sonorants,  associated  mainly  with  low-pitched  voices. 

Mixed  excitation  sources  are  not  new;  they  have  previously  been  applied  to  channel  vocoders 
[15,16]  and  to  the  formant  synthesizer  [17]  to  improve  voice  quality.  Our  improvement  to  the  LPC 
excitation  signal  also  uses  a  mixed  excitation  source.  In  our  approach,  the  mixed  excitation  source  is 
simply  a  special  case  of  the  excitation  signal  generator  described  in  Eq.  (1)  and  can  have  both  pitch- 
epoch  variations  and  period-to-period  waveform  variations.  Because  we  are  constrained  by  the  DoD 
interoperability  requirements  we  cannot  use  any  information  not  transmitted  by  the  standard  nar¬ 
rowband  LPC.  While  some  flexibility  is  lost  by  not  using  this  additional  information,  our  mixed  excita¬ 
tion  source  is  still  much  closer  to  the  ideal  excitation  for  the  LPC  analysis/ synthesis  system  (i.e.,  the 
prediction  residual)  than  is  the  conventional  excitation. 
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Fit-  5  —  Unprocessed  speech  sad  prediction  residual  waveforms  of  i 
voices.  Note  the  randomness  of  the  prediction  reehlnsl,  particularly  the  I 
diction  residual,  and  compare  this  waveform  with  the  conventional  narrowband  LPC 
voiced  excitation  signal  shown  in  Fig.  I.  Some  amount  of  randomness  in  the  excitation 
signal  is  essential  for  the  production  of  natural  sounding  speech.  Note  also  the  highly 
oscillatory  speech  waveform  characteristic  of  mellow  voices.  The  prediction  residual 
waveforms  illustrated  in  this  figure  (as  well  as  those  in  Figs.  6  and  7)  have  been  amplified 
four  times  for  clarity. 


Fig.  6  —  Unprocessed  speech  and  prediction  residual  waveforms  of  normal  rotes.  Note 
that  the  periodicity  of  the  prediction  residual  is  better  defined  than  in  the  preceding 
figure,  hut  less  than  for  the  tense  voices  in  the  following  figure.  Figure  8  illustrates  that 
our  voiced  excitation  signal  for  the  narrowband  LPC  has  a  similar  amount  of  random- 
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Pig.  7  —  Unprocessed  speech  and  prediction  residual  waveforms  of  tarn  toicn.  Note 
that  the  well-defined  periodicity  of  the  prediction  residual,  even  the  high-passed  predic¬ 
tion  residual,  is  very  similar  to  that  of  the  conventional  narrowband  LPC  voiced  excita¬ 
tion  signal  (Fig.  8).  Note  also  the  highly  damped  speech  waveform  which  might  easily  be 
mistaken  for  a  seismic  wave. 
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Fig.  8  —  Synthesized  speech  and  excitation  signal  waveforms  for  the  narrowband  LPC. 
These  waveforms  are  generated  by  the  use  of  LPC  parameters  extracted  from  the  normal 
female  speech  waveform  shown  in  Fig.  6.  The  absence  of  randomness  in  the  convention¬ 
al  voiced  excitation  signal  is  in  part  responsible  for  the  tense  and  unnatural  speech  quality 
of  the  narrowband  LPC.  (Compare  the  left  column  of  this  figure  with  Fig.  7.)  The  pres¬ 
ence  of  randomness  in  our  voiced  excitation  signal  (right  column)  adds  naturalness  to  the 
synthesized  speech.  Our  voiced  excitation  signal  is  an  approximation  of  the  actual  predic¬ 
tion  residual  of  the  normal  female  voice  shown  in  Fig.  6. 
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The  phase  spectrum  4(k)  of  our  excitation  signal  as  expressed  by  Eq.  (1)  consists  of  two  parts: 

*(*)  -  *,,(*)  +  **(*)  k-  1,2 . K.  (9) 

where  0(A)  and  A  0(A)  are  the  Ath  stationary  and  random  phase  components  respectively.  The  random 
part  of  the  phase  spectrum  is  further  divided  into  two  parts: 

A 0(A)  -  A 0j(A)  +  A 02(*)  k  -  1.  2,  ....  K,  (10) 

where  A0j(A)  and  A 02(A)  are  the  random  phases  contributing  to  pitch-epoch  jitter  and  period-to- 
period  waveform  variations  respectively.  We  discuss  these  phase  spectral  components  in  the  following 
section. 

Stationary  Part  of  the  Phase  Spectrum 

The  stationary  part  of  the  phase  spectrum  of  the  voiced  excitation  signal  is  important  because  it 
has  a  direct  bearing  on  the  peakedness  and  dispersiveness  of  the  excitation  signal.  For  example,  if  the 
phase  spectrum  is  a  linear  function  of  frequency,  or  the  differential  delay  is  zero,  all  the  frequency 
components  will  be  phase-aligned  and  will  produce  a  spike  or  impulse. 

The  use  of  an  impulse  for  the  voiced  excitation  is  undesirable  for  two  reasons.  First,  a  spiky  exci¬ 
tation  signal  produces  a  spiky  narrowband  LPC  output  which  does  not  operate  well  in  tandem  with 
high-rate  voice  processors  that  encode  the  difference  of  two  consecutive  speech  samples,  such  as  con¬ 
tinuously  variable  slope  delta  (CVSD)  systems.  Because  CVSD  cannot  accurately  follow  the  steep 
changes  in  the  input  amplitude  produced  by  the  impulse  excitation,  the  output  speech  is  distorted. 
Over  the  years,  the  narrowband  LPC  has  improved  its  tandem  performance  with  the  CVSD.  At  one 
time  the  DRT  score  for  a  16  kb/s  CVSD  operating  from  the  narrowband  LPC  output  was  78  for  three 
male  and  three  female  voices;  it  is  now  82.  One  of  the  nuyor  reasons  for  this  improvement  is  the  use 
in  the  LPC  of  a  time-dispersed  voiced  excitation  signal  in  lieu  of  an  impulse  excitation. 

Second,  a  spiky  excitation  signal  requires  a  greater  dynamic  range  in  the  LPC  signal  processor,  so 
the  output  amplitude  often  has  to  be  lowered  to  avoid  clipping.  We  can  reduce  the  required  dynamic 
range  by  as  much  as  10  dB  by  using  a  time-dispersed  voiced  excitation  signal  like  that  discussed  below. 

On  the  other  hand,  it  is  also  undesirable  for  the  voiced  excitation  signal  to  be  dispersed  over 
several  pitch  periods  because  the  LPC  synthesizer  is  a  dynamic  system  in  which  the  filter  coefficients 
are  updated  pitch  synchronously.  The  problem  is  even  more  complicated  because  the  current  nar¬ 
rowband  LPC  calibrates  the  speech  level  after  the  synthesis,  with  a  constant  power  excitation  at  the 
input.  For  proper  superposition  and  calibration,  the  output  waveform  generated  by  each  set  of  excita¬ 
tion  signal  samples  and  filter  coefficients  must  be  stored  independently.  In  general,  a  shorter  excitation 
signal  requires  less  data  storage  and  fewer  computations. 

In  the  past,  a  number  of  different  approaches  have  been  investigated  in  an  effort  to  design  a  fam¬ 
ily  of  signals  with  flat  amplitude  spectra  and  low  peak  amplitudes  [9,18].  If  the  signal  is  expressed  as  a 
Fourier  series,  like  our  excitation  signal,  the  required  phase  spectrum  is  a  quadratic  function  of  fre¬ 
quency  [9], 

Thus, 

0O(*)  -  <2*>*(x]  *-0.1....*  (11) 

where  0O(A)  is  the  Ath  stationary  phase  component  defined  in  Eq.  (1),  K  is  the  number  of  spectral 
components  defined  in  Eq.  (2),  and  the  quantity  f  is  an  integer  number— the  larger  the  f,  the  greater 
the  dispersion  of  the  excitation  signal.  The  differential  delay,  as  obtained  from  Eq.  (11),  is 
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D0(k) 


A  *(*) 

4m> 


t*(*)  -  *(*  -  1)3 


Atf 


(2ir)(2£)  \± 
K(\w)  AT 


(12) 


in  which  Aw  is  a  uniform  frequency  spacing  between  two  adjacent  spectral  components, 
rowband  LPC,  K  (Aw)  is  (2ir)4000  rad/s.  Thus,  Eq.  (12)  may  be  written  as 


In  our  nar 


(13) 


Equation  (13)  states  that  if  the  phase  angle  is  a  multiple  of  2rr  rad  at  4000  Hz,  the  differential  delay  at 
the  same  frequency  is  a  multiple  of  0.5  ms. 


For  purposes  of  illustration,  we  generated  four  different  voiced  excitation  signals  using  £  *  3,  4, 
5,  and  6  in  Eqs.  (11)  and  (13).  Table  3  lists  the  spectral  and  temporal  characteristics  of  these  signals. 
In  Example  1  (£  —  3)  the  differential  delay  increases  linearly  from  0  ms  at  0  Hz  to  1.5  ms  at  4000  Hz. 
Table  4  shows  the  excitation  signal  samples  which  are  dispersed  over  25  sampling  time  intervals.  The 
peak  amplitude  reduction  factor— defined  as  the  maximum  signal  magnitude  when  the  signal  is  normal¬ 
ized  to  have  a  unity  power— is  8.98  dB.  This  is  an  impressive  figure  since  the  peak  amplitude  reduction 
factor  realized  by  the  40-sample  voiced  excitation  signal  currently  used  by  the  DoD  narrowband  LPC  is 
only  9.18  dB.  In  the  second  example  (£  “  4),  the  differential  delay  at  4000  Hz  is  increased  to  2  ms, 
and  the  excitation  signal  samples  are  dispersed  over  31  sampling  time  intervals.  The  resulting  peak 
amplitude  reduction  factor  is  increased  to  9.51  dB,  and  so  on. 

For  our  excitation  signal  we  set  £  —  3  in  Eqs.  (11)  and  (13)  (Example  1)  because  this  yields  a 
good  peak  amplitude  reduction  factor  for  the  duration  of  the  excitation  signal.  To  verify  that  this  25- 
sample  excitation  signal  can  reproduce  the  originally  specified  frequency  spectra  characteristics,  we  com¬ 
puted  both  the  amplitude  and  the  phase  spectra.  (We  feared  that  integerization  and  truncation  of  sam¬ 
ples  might  have  produced  some  spectral  error.)  Figure  9  shows  that  the  computed  spectra  are  virtually 
identical  to  the  originally  specified  spectra. 


Table  3— Characteristics  of  Stationary  Part  of  Voiced  Excitation  Signals 


Example 

Amplitude 

Spectrum 

Phase  Shift6 
@  4000  Hz 
(2  ir)  £ 
(rad) 

Diff.  Delay* 

@  4000  Hz 
0.5  £ 

(ms) 

Absolute  Maximum 
Amplitude  When 
£  e2(n)  —  1 
(dB) 

Dispersion 

Width* 

(No.  of  Samples) 

Flat 

H2  ir) 

25 

1 

Flat 

4(2  ir) 

-9.51 

31 

Flat 

5(2  it) 

RE  jlr&K 

-9.91 

35 

mm 

Flat 

6(2  ir) 

wmm 

0.2835 

-10.95 

41 

*Our  choice. 

^The  phase  spectrum  it  a  quadratic  function  of  frequency. 
cTbe  differential  delay  is  a  linear  function  of  frequency. 

dFor  comparison  purposes,  the  dispersion  width  is  arbitrarily  defined  as  the  time  interval  in  which  every  sample  has  a 
magnitude  >  1/256  when  the  signal  amplitude  has  normalized  to  have  a  unity  power. 
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GO  Time  sample*  (25  samples) 


FREQUENCY  (kHz) 
(b)  Amplitude  spectrum 


FREQUENCY  (kHz) 
(c)  Differential  delay 


Fig-  9  —  Our  cboaea  stationary  voiced  excitation  signal:  time  samples,  computed  amplitude  spectrum, 
and  differential  delay.  This  is  Example  1  in  Tables  3  and  4  and  ia  obtained  by  letting  f  *■  3  in  Eq.  (11) 
or  Eq.  (13). 

It  is  interesting  to  note  that  the  delay  shown  in  Fig.  9(c)  is  similar  to  the  median  delay  computed 
from  the  actual  prediction  residual  by  Atal  and  David  [13].  The  median  delay  also  increases  nearly 
linearly  with  the  increase  in  frequency.  The  total  delay  increment  from  0  Hz  to  the  highest  frequency 
is  approximately  1.2  ms,  which  is  close  to  that  shown  in  Fig.  9(c). 

Randans  Part  of  the  Phase  Spectrum 

As  stated  previously,  there  are  two  types  of  randomness  present  in  the  natural  voiced  speech 
waveform.  One  is  pitch-epoch  variation,  or  jitter,  caused  by  irregularities  in  vocal  cord  movement;  the 
other  is  period-to- period  waveform  variation  caused  by  the  turbulent  air  flow  from  the  lungs.  To  incor¬ 
porate  these  variations  in  the  excitation  signal  we  need  two  different  kinds  of  random  spectral  com¬ 
ponents  as  discussed  below. 

Pitch-Epoch  Variations 

The  magnitude  of  pitch-epoch  variations  is  not  large— the  average  shift  is  reportedly  somewhere 
between  10  and  60  fin  for  adult  male  speakers  [19].  The  presence  of  this  small  amount  of  pitch  varia¬ 
tion  is  nevertheless  essential  to  make  synthesized  speech  sound  more  natural.  Because  the  pitch  period 
as  transmitted  by  the  narrowband  LPC  is  merely  the  average  pitch  period  updated  at  a  fixed  frame  rate 
(approximately  two  pitch  periods  for  an  average  male  speaker,  and  four  pitch  periods  for  an  average 
female  speaker),  it  does  not  contain  any  information  related  to  pitch-epoch  variation.  Even  if  the  pitch 
period  were  updated  several  times  per  frame,  it  still  would  not  reflect  the  actual  pitch-epoch  variation 
because  the  pitch  tracker  has  too  much  inertia  to  be  influenced  by  such  small  changes.  Moreover,  the 
pitch  period  quantization,  where  the  minimum  pitch  period  resolution  is  one  sampling  time  interval,  or 
12S  >*>,  is  far  too  coarse  to  capture  pitch-epoch  variations  as  small  as  10  to  60  ps.  In  short,  pitch-epoch 
variation  in  the  narrowband  LPC  must  be  artificially  introduced  at  the  receiver. 

In  our  voiced  excitation  signal,  the  pitch  epoch  is  readily  altered  by  allowing  an  additional  linear 
phase  in  the  phase  spectrum  as  expressed  by  Eq.  (1).  The  gradient  of  the  linear  phase  is  randomly  per¬ 
turbed  from  one  pitch  period  to  the  next.  As  an  example,  if  the  phase  changes  linearly  from  0  rad  at 


"■  ' .  ny« >  \  a  .. 


NRL  REPORT  8799 


0  Hz  to  1  rad  at  4  kHz,  the  resulting  differential  delay  of  the  time  waveform  is  l/8000ir  second  or 
39.789  n*.  A  smaller  phase  shift  gives  rise  to  a  proportionally  smaller  shift  in  pitch  epoch.  We  found  a 
maximum  jitter  of  10  ps  to  be  satisfactory.  Thus  the  phase  shift  at  4  kHz  is  a  maximum  of  1/4  rad  and 
is  computed  by 


A*,(*) 


SL 

4 


*-  1,  2.  ....  K 


(14) 


where  Adi  (A:)  is  the  random  part  of  the  phase  spectrum  contributing  to  pitch  epoch  variations,  A  is  the 
frequency  index,  K  is  the  total  number  of  frequency  components,  and  m  is  a  uniformly  distributed  ran¬ 
dom  number  between  1  and  -1  which  changes  at  each  pitch  epoch. 

It  is  worth  noting  that  even  under  the  most  ideal  operating  conditions  (such  as  noise-free  speech 
and  error-free  transmission)  the  narrowband  LPC  generates  a  considerable  amount  of  pitch  irregularity, 
or  flutter,  in  the  synthesized  speech.  This  is  primarily  because  the  LPC  analysis  window  is  not  placed 
in  perfect  synchrony  with  the  pitch  cycle.  This  effect  is  further  aggravated  by  the  parameter  quantiza¬ 
tion,  which  tends  to  cause  the  synthesized  speech  waveform  to  vary  even  when  the  input  is  well  sus¬ 
tained.  Since  the  narrowband  LPC  updates  the  speech  parameters  once  every  frame,  the  frequency  of 
the  flutter  is  fairly  low,  and  our  ears  are  rather  sensitive  to  it.  Therefore,  the  pitch-epoch  jitter  must 
not  reinforce  the  already  audible  low-frequency  flutter.  (Note  that  flutter  of  this  kind  would  not  exist 
in  a  speech  synthesis  system  where  the  speech  data  are  defined  at  irregular  and  sparsely  spaced  time 
intervals.  However,  in  this  case  the  magnitude  of  the  minimum  pitch-epoch  jitter  would  be  even 
greater  than  that  of  the  narrowband  LPC.) 

Period-  To-Pertod  Waveform  Variations 

The  period-to- period  waveform  variations  caused  by  breath  air  are  very  complex.  On  the  one 
hand  they  are  random  because  the  air  coming  from  the  lungs  is  turbulent  On  the  other  hand  they  ate 
pitch-modulated  because  the  air  passes  through  the  glottis  as  it  opens  and  closes  at  the  pitch  rate.  The 
period-to-period  waveform  variations  in  the  prediction  residual  (the  ideal  excitation  signal)  are  dispro- 
portionally  strong  in  the  high-frequency  regions  because  the  LPC  analysis  filter  boosts  the  treble  to 
flatten  the  spectral  envelope  of  the  speech,  but  not  that  of  the  breath  noise.  Figures  5  through  7  show 
that  the  amount  of  period-to-period  waveform  variation  in  the  prediction  residual  differs  substantially 
from  speaker  to  speaker.  In  addition,  evidence  indicates  that  the  amount  of  waveform  variation 
depends  on  the  speech  sound;  for  example,  there  is  more  randomness  in  back  vowels  than  in  front 
vowels. 


Period-to-period  waveform  variations  are  caused  by  a  multitude  of  factors  that  cannot  be  emulated 
by  a  simple  mixed  excitation  source,  nor  by  our  general  form  of  the  mixed  excitation  source,  when 
relevant  information  is  not  available  at  the  receiver.  Because  a  many-to-one  transformation  exists 
between  random  noise  and  its  perception  by  the  human  ear,  the  nature  of  any  artificially  introduced 
randomness  in  the  voiced  excitation  signal  need  not  be  exactly  identical  to  that  of  the  prediction  re¬ 
sidual.  For  example,  unvoiced  sounds  from  the  telephone  are  severely  distorted,  yet  we  can  still  iden¬ 
tify  them.  Similarly,  the  spectral  distribution  of  a  fricative  sound  varies  widely  from  speaker  to  speaker 
[20],  but  this  does  not  cause  any  misunderstanding.  According  to  a  recent  experiment  at  NRL,  the 
intelligibility  of  the  narrowband  LPC  speech  is  virtually  unaffected  even  when  the  set  of  LPC 
coefficients  from  unvoiced  speech  is  quantized  very  coarsely  into  an  eight-bit  quantity  (i.e.,  one  of  only 
256  possible  combinations). 

We  listened  to  a  large  number  of  speech  samples  processed  by  our  real-time  narrowband  LPC  as 
we  varied  the  nature  of  the  random  components  in  the  voiced  excitation  signal.  While  there  seemed  to 
be  a  wide  range  of  acceptable  characteristics,  we  noted  that  the  overall  intensity  and  the  frequency  dis¬ 
tribution  of  the  random  components  appeared  to  be  more  significant  than  other  parameters.  The 
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overall  intensity  is  important  because  the  speech  quality  suffers  both  if  it  is  too  low  or  if  it  is  too  high. 
The  frequency  distribution  characteristics  are  also  important  because  the  speech  sounds  warbly  if  there 
is  too  much  low-frequency  jitter.  Note  that  these  are  the  only  two  parameters  used  by  the  narrowband 
LPC  to  synthesize  unvoiced  speech. 

Unfortunately  we  cannot  extract  nor  transmit  these  two  parameters  at  the  LPC  transmitter 
because  the  resulting  LPC  would  not  be  compatible  with  the  standard  DoD  format.  Therefore  we 
would  like  to  extract  average  values  for  these  two  parameters  from  the  actual  prediction  residual  so  that 
we  may  use  them  as  constants  in  the  LPC  receiver.  This  analysis  is  by  no  means  straightforward;  the 
selection  of  the  proper  prediction  residual  samples  and  the  choice  of  the  analysis  method  are  both 
critical. 


The  prediction  residual  samples  must  be  selected  carefully  because  period-to-pe'iod  waveform 
variations  in  the  prediction  residual  are  caused  not  only  by  breath  noise  and  the  instability  of  the  excita¬ 
tion  source  (i.e.,  the  glottis),  but  also  by  the  changes  in  the  vocal  tract  during  speech  transitions.  Since 
we  would  like  to  exclude  the  effects  of  the  speech  transitions  in  the  estimated  parameters,  we  must 
select  prediction  residual  samples  from  voiced  frames  where  the  LPC  coefficients  (i.e.,  the  vocal  tract 
filtering  characteristics)  do  not  vary  significantly  from  one  frame  to  the  next.  In  other  words,  we  must 
select  the  prediction  residuals  for  analysis  from  sustained  vowels. 


Once  the  residual  samples  are  selected,  the  choice  of  the  analysis  method  is  critical  for  obtaining 
reliable  analysis  results.  The  most  direct  way  of  estimating  the  intensity  and  frequency  distribution 
parameters  is  through  a  variance  analysis  of  the  phase  spectra  derived  from  the  prediction  residual  using 
a  pitch-synchronous  analysis  window.  However,  we  find  this  approach  insurmountably  difficult  and 
risky  since  even  visual  inspection  cannot  reliably  determine  the  pitch  epoch  from  a  highly  noise-like 
prediction  residual  (for  example,  see  Fig.  S).  The  phase  spectrum  is  sensitive  to  the  location  of  the 
window  with  respect  to  the  waveform  under  analysis,  and  frequent  window  placement  errors  will 
degrade  the  estimated  parameters  beyond  any  usefulness.  Since  we  are  basically  interested  in  the  gross 
characteristics  of  the  frequency  dependency  and  the  overall  intensity,  rather  than  their  detailed  frame- 
by-frame  characteristics,  we  choose  to  use  an  alternate  method  of  analysis. 


This  alternate  method  involves  the  spectral  analysis  of  the  pitch-filtered  prediction  residual 
defined  by 

r’U)  -  r(l)  -  firU  -  T)  (15) 

where  r(i)  is  a  prediction  residual  sample,  r'(i)  is  a  pitch-filtered  prediction  residual  sample,  T  is  the 
pitch  period,  and  0  is  a  first-order  prediction  coefficient  of  r(i)  T  samples  apart.  As  usual,  0  is 
obtained  by  minimizing  the  mean-square  value  of  the  right-hand  member  of  Eq.  (15).  Thus, 


JrU)r{i-  T) 
£r2(f  -  T) 


(16) 


Since  we  select  only  stationary  prediction  residuals  for  the  analysis,  0  may  be  expressed  by 
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^r(l)rU-T) 
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(17) 


where  the  magnitude  is  bounded  between  1  and  -1.  Equation  (15)  represents  the  input-output  rela¬ 
tionship  of  a  notch  filter  which  supresses  harmonically  related  frequencies  (in  this  case,  the  fundamen¬ 
tal  pitch  frequency  and  its  harmonics).  The  quantity  0  is  related  to  the  notch  filter  bandwidth  and  is 
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dependent  on  the  randomness  of  the  input  For  exunpie,  in  the  absence  of  randomness,  as  in  the  con¬ 
ventional  voiced  excitation  signal,  0  is  unity.  For  actual  prediction  residuals  from  steady  vowels,  0  lies 
somewhere  between  0.7  and  0.9. 

With  a  steady  vowel  as  the  input,  the  pitch-filtered  prediction  residual  is  mainly  period-to- period 
waveform  variations  of  the  prediction  residual.  Thus,  the  spectral  analysis  of  the  pitch-filtered  predic¬ 
tion  residual  indicates  both  the  nature  of  the  frequency  dependency  and  the  overall  intensity  of  the  ran¬ 
dom  puts  of  the  prediction  residual.  Figure  10  shows  the  amplitude  spectra  of  pitch-filtered  prediction 
residuals  generated  from  the  three  types  of  female  voice  waveforms  previously  illustrated  in  Figs.  5 
through  7.  For  reference,  Fig.  10  also  shows  the  amplitude  spectra  of  the  corresponding  prediction 
residuals.  Note  that  the  irregular  spectral  pattern  of  the  prediction  residual  (mainly  in  the  high- 
frequency  region)  may  or  may  not  be  related  to  the  presence  of  period-to-period  waveform  variations. 
This  irregularity  may  also  be  due  to  the  relatively  constant  absorption  of  selected  frequencies  by  the 
vocal  tract. 


Fig.  10  —  Amplitude  speed  of  prediction  residuals  and  pitch-filtered  prediction  residuals  from  the  three  fe¬ 
male  voices  shown  in  Figs.  5  through  7.  As  noted,  the  amplitude  spectrum  of  the  pitch-filtered  prediction 
residual  generally  increases  with  frequency. 


The  spectral  distribution  of  the  pitch-filtered  prediction  residual  is  significant  because  it  represents 
the  spectrum  of  the  period-to-period  waveform  variations  in  the  prediction  residual.  We  introduce  ran¬ 
dom  components  in  the  voiced  excitation  signal  such  that  the  amplitude  spectrum  of  the  pitch-filtered 
excitation  signal  has  a  spectral  distribution  similar  to  that  of  normal  voices  as  shown  in  Fig.  10.  This 
figure  as  well  as  similar  plots  of  other  voices  show  that  the  amplitude  spectrum  of  the  pitch-filtered 
prediction  residual  is  an  approximately  linear  function  of  frequency,  and  the  pitch  prediction  coefficient 
0  is  approximately  0.85.  Thus  the  random  part  of  the  phase  spectrum  Afa(fc)  is  obtained  numerically 
by  using  Eqs.  (1),  (15),  and  (17): 

a*2(*>  -  rad 


(18) 


KANG  AND  EVERETT 


where  tr(k)  is  a  uniformly  distributed  random  variable  between  —1  and  1,  k  is  the  frequency  index, 
and  K  is  the  total  number  of  components  within  the  0  to  4  kHz  passband.  Figure  11,  which  is  similar 
to  Fig.  10,  compares  the  conventional  voiced  excitation  signal  and  our  modified  voiced  excitation  signal. 
Note  that  our  pitch-filtered  excitation  signal  has  characteristics  more  similar  to  those  of  the  prediction 
residual  of  the  normal  voice.  (The  time  samples  of  both  excitation  signals  are  shown  in  Fig.  8.) 


Fig.  11  —  Amplitude  spectra  of  the  voiced  excitation  signal  and  the  pitch-filtered  voiced  excitation  signal  for  the 
conventional  excitation  (upper  illustrations)  and  our  modified  excitation  (lower  iliustrationa).  Both  are  derived 
from  LPC  parameters  generated  by  using  the  speech  waveform  of  the  normal  female  voice  shown  in  Fig.  6.  (The 
prediction  residual  spectrum  and  pitch-filtered  residual  spectrum  of  this  voice  are  shown  in  Fig.  10.)  The  conven¬ 
tional  voiced  excitation  signal  has  a  small  amount  of  randomness  because  we  carefully  introduced  the  actual  LPC 
parameter  quantization  and  interpolation  effects  in  the  excitation  signal,  but  the  amount  of  randomness  is  negligi¬ 
ble.  On  the  other  hand,  our  voiced  excitation  signal  has  randomness  in  which  the  frequency  dependency  and 
magnitude  (in  terms  of  the  0  value)  are  Similar  to  those  of  the  pitch-filtered  prediction  residual  of  the  actual 
speech  as  shown  in  Fig.  10. 


Test  and  Evaluation 

When  our  voiced  excitation  signal  is  used  in  the  narrowband  LPC,  one  can  readily  hear  that  the 
output  speech  has  a  quality  of  breathiness  not  unlike  that  of  the  unprocessed  speech.  The  output 
speech  sounds  much  livelier,  and  the  buzzy,  twangy  qualities  often  present  in  the  conventional  nar¬ 
rowband  LPC  output  are  greatly  reduced.  DAM  tests  were  conducted  to  ascertain  the  degree  of  quality 
improvement  achieved.  The  test  results  show  a  4.7-point  improvement  for  male  speakers  (from  48.6  to 
54.3)  and  a  5.0-point  improvement  for  female  speakers  (from  44.7  to  49.7).  The  scores  for  the 
modified  LPC  compare  favorably  with  those  for  a  9.6  kb/s  voice  processor  (54.8  for  males  and  53.5  for 
females).  A  DRT  was  also  conducted  to  ensure  that  the  phase  spectral  modification  did  not  produce 
such  strong  improvements  in  speech  quality  at  the  expense  of  speech  intelligibility.  As  expected,  the 
DRT  score  of  85.8  for  the  modified  LPC  was  only  slightly  better  than  the  score  of  85.3  for  the  conven¬ 
tional  LPC. 

MODIFIED  UNVOICED  EXCITATION  SIGNAL 

In  the  past,  the  unvoiced  excitation  signal  has  not  received  as  much  attention  as  the  voiced  excita¬ 
tion  signal.  The  excitation  signal  traditionally  used  for  generating  all  unvoiced  sounds  is  simple  random 
noise;  no  distinction  is  made  between  fricative  sounds  (/h/,  /s/,  /sh/,  /f/,  /th/)  and  burst,  or  stop, 
sounds  (/p/,  /t/,  /k/).  Usually  the  excitation  signal  is  generated  by  randomly  picking  numbers  from  a 
table  containing  uniformly  distributed  random  numbers;  a  small  table  containing  about  256  numbers  is 
adequate. 
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In  our  modified  excitation  signal  generator  both  the  voiced  and  unvoiced  excitation  signals  are 
synthesized  from  Eq.  (1).  They  only  differ  in  their  phase  spectra:  for  the  unvoiced  excitation  the 
phase  spectral  components  are  random  variables,  and  may  be  distributed  uniformly  between  -w  and  w 
radians.  According  to  the  Central  Limit  Theorem  our  unvoiced  excitation  signal  will  actually  tend  to 
have  a  Gaussian  distribution  because  each  sample  is  expressed  by  a  sum  of  random  variables  (Eq.  1). 
Figure  12  illustrates  the  probability  density  function  of  our  excitation  signal  computed  from  1000  sam¬ 
ples  having  uniformly  distributed  phase  spectral  components.  Figure  13  shows  that  the  probability  den¬ 
sity  function  of  our  unvoiced  excitation  is  approximately  Gaussian,  and  it  is  actually  a  better  approxima¬ 
tion  of  the  probability  density  function  of  the  prediction  residual  of  voiceless  fricative  speech  than  is 
the  uniformly  distributed  unvoiced  excitation  signal  used  in  the  conventional  narrowband  LPC. 


(a)  Tine  samples  (1000  samples) 


NOHMAUZH>  AMPLITUDE 

(b)  Probability  density  function  of  (a) 

FiS- 12  —  Characteristics  of  our  unvoiced  excitation  sisnal  used  to  fenerate  the 
fricative  sound  /a/.  The  normalized  amplitude  is  die  excitation  signal  ampli¬ 
tude  divided  by  its  RMS  value. 


...  a . 
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Despite  its  inaccurate  probability  (tensity  function,  the  conventional  unvoiced  excitation  signal  is 
adequate  for  generating  fricative  sounds.  This  signal,  the  resulting  synthesized  speech  waveforms,  and 
the  prediction  residuals  from  such  speech  waveforms  are  basically  stationary  noise.  Thus  the  ear  tends 
to  accept  them  as  fricative  sounds.  However,  this  excitation  is  not  satisfactory  for  generating  burst 
sounds.  The  onsets  of  these  sounds  generate  large  spikes  in  the  prediction  residuals  (Fig.  14),  but  the 
excitation  signal  conventionally  used  to  synthesize  them  is  still  stationary  noise.  As  a  result  CAT  is 
often  heard  as  HAT,  and  TICK  may  sound  like  THICK  or  SICK.  To  improve  the  reproduction  of 
unvoiced  bursts,  we  have  modified  the  unvoiced  excitation  signal  to  include  a  way  of  generating  such 
spikes. 

This  modified  excitation  signal  is  actually  a  superposition  of  two  signals:  one  is  similar  to  the  con¬ 
ventional  unvoiced  excitation  signal;  the  other  is  a  train  of  randomly  spaced  pulses.  The  amount  of 
pulse  energy  is  proportional  to  the  abruptness  of  the  unvoiced  speech  as  measured  by  the  speech  root- 
mean-square  (RMS)  ratio  of  two  adjacent  unvoiced  frames.  In  the  remaining  part  of  this  section  we 
examine  prediction  residuals  from  both  fricatives  and  abrupt  unvoiced  samples  and  compute  the  speech 
RMS  ratios  from  various  unvoiced  onsets.  We  also  present  evidence  demonstrating  that  the  modified 
unvoiced  excitation  signal  enhances  the  reproduction  of  unvoiced  stops  in  the  narrowband  LPC. 

Fricative  Sounds  and  Their  Prediction  Residuals 

In  speech,  fricative  noise  is  generated  by  a  turbulence  in  the  airflow  caused  by  a  constriction 
somewhere  in  the  vocal  tract.  The  place  of  the  constriction  determines  the  frequency  spectrum  and  the 
intensity  of  the  sound.  Figure  13  shows  the  amplitude  distribution  of  the  prediction  residual  processed 
from  1000  samples  of  /s/  at  the  trailing  end  of  COURSE  (female  speaker).  The  amplitude  distributions 
of  the  prediction  residuals  for  other  fricative  sounds  are  similar  to  the  example  shown  [20,21].  These 
distributions  may  be  approximated  by  the  Gaussian  distribution  function,  and  as  such,  the  conventional 
excitation  signal  is  adequate  for  producing  these  fricatives  within  the  4  kHz  passband. 

Unvoiced  Plosives  and  Their  Prediction  Residuals 

A  plosive  burst  is  a  sequence  of  events  that  involves  the  integration  of  both  spectral  and  temporal 
cues.  First,  a  rapid  closure  is  affected  at  some  point  in  the  oral  cavity  and  pressure  is  built  up  behind  it 
When  the  closure  is  released  a  burst  of  energy  having  a  broad  bandwidth  and  short  duration  is  gen¬ 
erated.  Unvoiced  bursts  (/p/,  /t/,  /k/)  are  louder  and  longer  than  voiced  bursts  (/b/,  /d/,  /g/)  since 
more  pressure  is  developed  before  release  [21]. 

Because  of  this  sudden  burst  of  energy,  the  amplitude  of  the  prediction  residual  of  an  unvoiced 
burst  is  particularly  large  at  the  onset  of  the  sound.  Therefore  the  accurate  synthesis  of  unvoiced  plo¬ 
sives  requires  an  excitation  signal  having  one  or  more  sharp  spikes  at  the  onset.  However,  spikes 
should  not  be  present  at  the  onsets  of  fricative  sounds.  The  implementation  of  such  an  excitation  sig¬ 
nal  therefore  requires  a  way  of  measuring  the  abruptness  of  the  speech  to  discriminate  between  the 
burst  onsets  of  stops  and  the  relatively  gentle  onsets  of  fricatives.  Because  data  rate  restrictions  prohi¬ 
bit  the  transmission  of  any  additional  information,  this  measure  must  be  derived  from  the  LPC  parame¬ 
ters  available  at  the  receiver. 

Measure  of  Abruptness 

The  abruptness  of  the  speech  is  related  to  the  amount  of  change  in  the  speech  energy  over  a  short 
period  of  time.  Thus  the  ratio  of  the  speech  RMS  values  from  two  consecutive  frames  should  indicate 
the  degree  of  abruptness.  To  test  this  hypothesis,  we  randomly  selected  words  containing  abrupt  and 
nonabrupt  unvoiced  consonants  and  computed  the  speech  RMS  ratios  at  the  consonant  onsets.  The  test 
words  were  excerpted  from  casually  spoken  sentences,  so  they  were  not  articulated  any  more  carefully 
than  would  be  expected  in  normal  conversational  speech.  The  computed  speech  RMS  ratios,  listed  in 
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Table  S,  are  consistently  larger  for  the  stops  and  smaller  for  the  fricatives.  This  is  also  true  for  the  two 
words  (TOOK  and  TOWN)  contaminated  by  helicopter  carrier  noise. 


Table  5— Speech  RMS  Ratios  From  Two 
Consecutive  Unvoiced  Frames 


Test  Words 

(The  underline  indicates  where 
the  RMS  ratio  is  computed) 

Ratio  of  Speech 
RMS  Values  from 
Two  Consecutive 
Unvoiced  Frames1 

out 

14 

stop 

17 

to 

32 

blunt 

34 

Abrupt 

can 

19 

Unvoiced 

take 

20 

Plosives 

course 

25 

took* 

26 

town* 

19 

at  your 

22 

Ripe 

11 

fitop 

2 

Nonabrupt 

getf 

5 

A 

Unvoiced 

B* 

Fricatives 

J|is 

3 

&arp 

2 

fred 

2 

*RMS  ratkM-lM  than  4  are  Mt  to  4  to  raduca  tin  effect  of 
noise  interference  (aae  the  text). 

^With  shipboard  background  note 


In  general  the  presence  of  background  noise  decreases  the  magnitude  of  the  speech  RMS  ratio,  so 
unvoiced  stops  tend  to  sound  like  fricatives  unless  the  noise  interference  is  reduced  somehow.  For  this 
reason  we  recommend  the  use  of  a  noise-cancellation  microphone  and  noise-suppression  preprocessing, 
such  as  the  spectral  subtraction  method  [1],  in  noisy  platforms.  Table  6  lists  the  cumulative  probability 
functions  of  background  noise  RMS  values  from  eight  different  platforms  by  using  both  a  noise- 
cancellation  microphone  and  noise-suppression  preprocessing.  If  the  noise  floor  is  less  than  10  dB 
when  the  speech  amplitude  is  quantized  to  12  bits  per  sample,  the  effect  of  the  noise  floor  on  the  RMS 
ratio  is  not  significant  However,  we  set  the  minimum  RMS  at  4  in  order  to  reduce  the  contrast 
between  noise-free  and  noisy  cases  when  computing  the  RMS  ratio.  The  values  in  Table  S  were 
obtained  on  this  basis. 

Modified  Unvoiced  Excitation  Signal  Model 

Our  objective  here  is  to  improve  the  sound  quality  of  unvoiced  stops  in  the  narrowband  LPC  by 
using  only  the  information  available  at  the  receiver.  We  concluded  that  the  best  way  to  accomplish 
this  was  to  modify  the  excitation  signal  by  introducing  sharp  spikes  as  discussed  above.  In  essence  our 
modified  unvoiced  excitation  signal  is  the  conventional  unvoiced  excitation  signal  with  a  superimposed 
train  of  randomly  spaced  pulses.  Thus,  it  may  be  expressed  by 

*(/)  -<»(/)  +Rp(l) 


(19) 


>  f.te •< 


mx  R-  -«*  -«**.  *v<v  yvjfVy 


Teat  Conditio  nine 


Quiet 

Airborne  command  poet  noiee 
Shipboard  noieee 
Office  noise 
E3A  noiee 

Heiioopter  carrier  noiee 
P3C  turboprop  noiee 
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Table  6— Cumulative  Probabilities  of  Background  Noise  Amplitudes 
Observed  at  Eight  Different  Military  Platforms 


Narrowband  LPC  Amplitude  Parameter* 


0.99  0.99  1.00  1. 

0.99  0.99  1.00  1.00 

0.94  0.96  0.98  0.99 


0.59  0.83  0.91  0.9S 

0.02  0.02  0.07  0.30 

0.19  0.71  0.90  0.95 


0.95  0.98  0.99  0.99  0.99  0.99 


0.80  0.90  0.96  0.98 


0.02  0.05  0.16  0.49  0.77  0.81  0.94 

0.02  0.03  0.14  0.39  0.66  0.82  0.89 


The  normal  speaking  level  ie  approximately  110  dB  sound 
(1/4  bach)  away  from  the  mouth. 

*The  narrowband  LPC  amplitude  parameter  ie  the  root-mee 
It  ie  expressed  in  an  integer  number  between  0  and  512. 


level  (SPL)  at  the 


mean-square  value  of  the 


0.91  0.99 


0.96  0.99 


located  6  me 


where  e(l)  is  the  modified  unvoiced  excitation  signal,  nil)  is  the  conventional  unvoiced  excitation  sig¬ 
nal  having  one  unit  of  RMS  value,  and  pit)  is  the  pulse  train  yet  to  be  discussed.  The  quantity  R,  a 
factor  proportional  to  the  speech  RMS  ratio  discussed  in  the  preceding  section,  is  updated  at  each 
frame.  Note  that  the  superposition  of  a  pulse  train  onto  the  conventional  excitation  signal  does  not 
make  the  synthesized  speech  any  louder,  even  if  R  is  greater  than  zero,  because  the  synthesised  speech 
mphtndr  is  calibrated  by  the  same  speech  RMS  value  regardless  of  the  nature  of  the  excitation  signal 


The  random  spike  component  of  the  modified  unvoiced  excitation  signal  is  dominant  only  at  the 
onsets  of  unvoiced  stops,  and  then  usually  for  a  single  isolated  frame  (Fig.  14).  Since  the  human  oar 
cannot  accurately  analyze  the  turbulent  speech  waveform  over  such  a  short  period  of  time,  the  exact 
nature  and  location  of  the  spikes  is  not  terribly  critical.  After  examining  numerous  residual  samples 
from  unvoiced  stops  and  conducting  listening  tests  with  synthesized  stops,  we  decided  to  use  four  ran¬ 
domly  spaced  spikes  per  frame  (Fig.  IS). 


onsct  or  cam 


onset  of  comm 


m  m  out 


*satpllBo d  4  timm  for  targer  display 

Pig.  14  -  Three  examples  of  unvoted  plosives  and  their  prediction  residuals.  Note  targe  spikes  in  the  prediction 
residue]  at  the  onsets.  Without  those  spaces,  the  plosives  often  sound  more  Hke  fricatives. 
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AMPLITUDE  0? 
RANDOM  PULU 

M  EO.  (10) 


TME-MMI/EFORM 


AMPLITUDE  SPECTRUM 


ISO  SAMPLES 


12  9  4 

FREQUENCY  MM 


ilW" 

1  2  9  « 
FREQUENCY  MM 


12  9  4 


Fig.  15  -  Our  unvoted  excitation  tenth  ate  (Mr  an 
•pikes  in  our  unvoted  excitation  signal  improves  tbs  p 
tity  it  is  related  so  the  speech  RMS  ratio  acroaa  two  adi 
is  zero,  tbs  resulting  waveform  la  tbs  conventional 
amplitude  spectrum  of  our  unvoted  excitation  signal 
resonant  fregueacte. 


12  9  4 

FREQUENCY  Mtal 

is  spectra.  The  presence  of 
don  of  piosivea.  The  quan- 
unvoted  frames.  When  R 
ced  excitation  signal  The 
not  show  any  undesirable 


We  observed  that  the  greater  the  jump  in  speech  RMS  between  two  adjacent  unvoiced  frames,  the 
greater  the  amplitude  of  the  prediction  residual  spikes.  Therefore  we  made  the  amplitude  of  each 
pulse,  denoted  by  R  in  Eq.  (19),  proportional  to  the  speech  RMS  ratio.  As  defined  previously,  R  -  1 
implies  that  each  pulse  amplitude  is  equal  to  the  RMS  value  of  the  random  component  n(f)  in  Eq. 
(19).  Figure  15  shows  that  when  R  -  6  the  resulting  spike  amplitude  is  sufficient  for  even  the  most 
distinctive  stop  bunts  whose  RMS  ratios  are  around  25  (gee  Table  5).  Therefore  a  reasonable  value  for 
Ris 

R  -  (Speech  RMS  Ratk>)/4  (20) 

where  R  is  limited  to  a  minimum  of  0  and  a  maximum  of  6.  The  pulses  are  spaced  randomly  so  that 
they  do  not  introduce  harmonically  related  frequencies  similar  to  pitch  or  formant  frequencies. 
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The  strong  unvoiced  plosive  bursts  produced  by  our  modified  unvoiced  excitation  signal  can  easily 
be  seen  in  Fig.  16(b).  When  compared  to  the  output  of  the  conventional  LPC  (Fig.  16(c))  it  is  clear 
that  the  burst  information  present  in  the  original  speech  (Fig.  16(a))  has  been  reproduced  much  more 
accurately  by  our  unvoiced  excitation  signal.  This  results  in  clean,  sharp  plosive  onsets  and  improves 
the  intelligibility  of  these  sounds  noticeably —COURSE  no  longer  sounds  like  HORSE,  nor  PEN  like 
HEN. 


COURSE 


T'l 


(a)  Original  speech 


(b)  Narrowband  LPC  output  with  our  unvoiced  excitation  tixnal 


(c)  Narrowband  LPC  output  with  conventional  unvoiced  exdtaoou  eignci 

Fig.  It  -  Spectrograms  of  narrowband  LPC  input  and  output  When  our  unvoiced  excitation  is 
used,  the  onsets  of  CAN,  TAKE,  and  COURSE  are  reproduced  better  at  the  narrowband  LPC  out¬ 
put  Note  the  sudden  bursts  of  speech  energy  at  these  onsets  fat  Fig.  16(b)  and  compare  diem  with 
(boss  fat  Fig.  !t(c). 


Test  and  Evaluation 

Our  modified  unvoiced  excitation  signal  was  developed  to  improve  reproduction  of  unvoiced 
speech,  in  particular  unvoiced  plosives.  The  DRT  is  an  excellent  means  for  evaluating  this  improve¬ 
ment  because  it  specifically  tests  the  intelligibility  of  initial  consonants  including  unvoiced  plosives.  We 
selected  female  speakers  for  the  testing  because  the  performance  of  the  narrowband  LPC  is  notoriously 
poorer  with  female  voices  than  with  male  voices  (average  DRT  scores  are  about  5.5  points  lower). 
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Table  7  lists  DRT  scores  for  three  female  speakers  using  the  narrowband  LPC  with  the  conven¬ 
tional  unvoiced  excitation  signal  and  with  our  modified  unvoiced  excitation  signal.  The  improvement 
for  the  attribute  ’graveness"  is  highly  significant  A  look  at  the  score  changes  for  the  features  within 
graveness  reveals  that  this  improvement  is  due  primarily  to  better  reproduction  of  unvoiced  sounds, 
particularly  plosives. 

Table  8  lists  the  four  features  within  graveness  and  the  test  words  associated  with  each  feature. 
When  the  attribute  graveness  is  present,  the  loci  of  the  second  and  third  formants  are  relatively  low; 
when  this  attribute  is  absent,  they  are  relatively  high.  In  both  cases  our  unvoiced  excitation  signal  pro¬ 
duces  higher  scores  for  all  sounds,  particularly  unvoiced  plosives. 


Table  7— DRT  scores  of  narrowband  L PC-processed  speech 
for  three  females.  The  first  set  of  scores  was  obtained  using 
the  conventional  unvoiced  excitation  signal;  the  second  set 
was  obtained  using  our  unvoiced  excitation  signal.  Note  the 
significant  difference  in  the  score  for  graveness  which  tests 
/p/  vs  lily  HI  vs  til,  among  others. 
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With  the  conventional  LPC  the  tendency  on  the  DRT  is  for  listeners  to  mistake  unvoiced  stop 
consonants  for  the  voiced  ones  because  the  bursts  are  not  reproduced  well.  The  improved  burst  repro¬ 
duction  with  the  modified  unvoiced  excitation  signal  reverses  this  tendency— the  voiced  sounds  are 
instead  mistaken  for  unvoiced.  This  may  be  largely  due  to  the  fact  that  many  of  the  plosive  consonants 
on  the  original  tape  were  articulated  directly  into  the  microphone,  thus  overemphasizing  the  bursts. 
Since  the  bursts  of  voiced  stops  are  normally  weaker  than  those  of  unvoiced  stops,  more  faithful  repro¬ 
duction  of  these  overly  strong  voiced  bursts  led  listeners  to  mistakenly  identify  them  as  unvoiced.  This 
tendency  accounts  for  much  of  the  drop  in  the  "voicing"  attribute  score,  and  is  consistent  with  the 
improvements  produced  by  our  modified  unvoiced  excitation  signal. 


EXPANDED  OUTPUT  BANDWIDTH 

Since  the  investigation  of  the  vocoder  by  Dudley  in  1939,  all  vocoders  have  been  implemented 
with  the  input  and  output  bandwidths  equal,  and  more  or  less  confined  to  4  kHz  and  below.  This  has 
also  been  true  in  the  development  of  digitally  implemented  voice  processors  such  as  the  narrowband 
LPC.  The  limited  bandwidth,  combined  with  spectral  distortions  caused  by  the  low  data-rate  encoding, 
makes  the  synthesized  speech  sound  rather  muffled,  particularly  for  unvoiced  fricatives  and  stop  con¬ 
sonants.  We  introduce  a  method  of  expanding  the  bandwidth  of  the  synthesized  speech  to  6  kHz  by 
folding  the  frequency  contents  between  2  and  4  kHz  upward  around  the  cutoff  frequency  of  4  kHz. 

Reasons  for  Output  Bandwidth  Expansion 

The  primary  reason  for  expanding  the  narrowband  LPC  output  bandwidth  is  to  allow  more  realis¬ 
tic  reproduction  of  unvoiced  speech  sounds,  particularly  stop  consonants  and  voiceless  fricatives.  We 
know  from  the  spectrograms  of  unprocessed  speech  that  the  spectra  of  these  sounds  often  extend  to 
6  kHz  or  beyond.  We  also  know  that  there  is  little  distinctive  formant  information  in  these  sounds,  so 
that  the  spectrum  between  2  and  4  kHz  is  similar  to  that  between  4  and  6  kHz.  Thus,  by  folding  the 
frequency  contents  between  2  and  4  kHz  upward  into  the  region  between  4  and  6  kHz,  we  can  makn 
the  spread  of  the  synthesized  speech  similar  to  that  of  the  original  speech.  The  presence  of  the  higher 
frequencies  makes  stop  consonants  sound  sharper  and  makes  voiceless  fricatives  sound  more  hissy. 

The  output  bandwidth  expansion  also  enhances  the  reproduction  of  voiceless  fricatives  whose 
spectra  were  originally  above  the  passband  of  the  LPC,  but  which  were  brought  down  within  the 
passband  by  the  selectively  applied  aliasing  process  described  as  part  of  our  LPC  analysis  improvements 
[1].  The  sound  quality  will  be  improved  because  the  output  bandwidth  expansion  operation  is  the  com¬ 
plement  of  the  aliasing  process. 

The  output  bandwidth  expansion  also  allows  the  use  of  an  output  low-pass  filter  which  cuts  off 
more  gently  than  that  of  the  conventional  narrowband  LPC.  If  the  low-pass  filter  cutoff  is  too  sharp  (in 
excess  of  100  dB/octave),  the  unvoiced  fricative  tends  to  whistle  because  the  cutoff  frequency  behaves 
as  a  resonant  frequency.  (Note  that  a  sharp  cutoff  low-pass  filter  is  never  used  in  the  playback  of  noisy 
78  RPM  acoustic  records.)  With  the  output  bandwidth  expansion,  the  output  low-pass  filter  may 
decrease  gradually  from  -3  dB  at  4  kHz  to  -60  dB  at  8  kHz. 

The  effect  of  the  output  bandwidth  expansion  on  voiced  speech  is  of  interest,  too.  Unlike  voice¬ 
less  speech,  voiced  speech  usually  does  contain  formant  information  between  2  and  4  kHz  which  is 
reflected  into  the  frequency  range  between  4  and  6  kHz  by  the  output  bandwidth  expansion  process. 
For  a  majority  of  voices,  however,  the  intensities  of  the  reflected  formants  are  weak,  as  will  be  illus¬ 
trated  later.  Even  for  voices  with  strong  upper  formant  frequencies,  the  presence  of  the  reflected  for¬ 
mants  does  not  affect  the  speech  intelligibility.  In  fact  it  tends  to  make  the  synthesized  speech  brighter, 
somewhat  akin  to  the  extraneous  formant  frequencies  of  the  singing  voice  [22],  often  called  "singers’ 
formants." 
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Finally,  the  expansion  of  the  output  bandwidth  from  1 S0-4000  Hz  to  150-6000  Hz  brings  further 
improvement  in  the  speech  quality  by  shifting  the  tonal  centroid  of  the  LPC  processed  speech  to  a 
more  favorable  location.  We  feel  that  the  conventional  narrowband  LPC  processed  speech  sounds 
rather  "bass  heavy,”  even  though  the  frequency  components  below  150  Hz  are  attenuated  at  the  rate  of 
18  dB/octave.  We  do  not  get  a  similar  feeling  with  unprocessed  speech,  even  though  none  of  the  low- 
frequency  components  are  attentuated.  One  explanation  for  this  effect  is  that  the  tonal  centroid  of  the 
unprocessed  speech  is  located  at  a  higher  frequency  because  the  bandwidth  extends  to  10  kHz  or  above. 
Similarly  the  expansion  of  the  output  bandwidth  in  the  LPC  helps  raise  the  tonal  centroid  of  the  pro¬ 
cessed  speech.  Note  that  raising  the  lower  cutoff  frequency  produces  a  similar  perceptual  effect;  for  this 
reason  the  lower  cutoff  frequency  for  the  telephone  is  often  as  high  as  300  Hz.  However,  the  lower 
cutoff  frequency  of  the  narrowband  LPC  cannot  be  much  higher  than  the  current  150  Hz  because  both 
the  pitch  tracking  and  the  voicing  decision  would  suffer  greatly  when  another  narrowband  LPC  is 
operated  in  tandem,  as  often  happens  in  military  communication  setups. 

Output  Bandwidth  Expansion  Process 

The  output  bandwidth  expansion  process  is  a  two-step  postsynthesis  operation  on  the  speech  sam¬ 
ples  synthesized  by  the  narrowband  LPC.  The  two  steps  required  are  (a)  the  spectral  folding  process  at 
double  the  sampling  rate  and  (b)  a  low-pass  filtering  operation.  Each  is  discussed  below. 


Spectral  Folding  Process 

The  spectral  folding  process  simply  involves  adding  a  zero  between  every  pair  of  adjacent  samples 
in  the  output  digital  waveform.  The  input  and  output  of  the  spectral  folding  process  are  depicted  in 
Figs.  17(a)  and  17(b)  respectively.  As  noted,  the  sampling  frequency  is  increased  by  a  factor  of  2. 


(a)  Input  (conventional  narrowband  LPC  output)  (b)  Output  (our  narrowband  LPC 

output  prior  to  low-pass  filtering) 


Fig.  17  —  Input  and  output  of  the  spectral  folding  process  in  the  sampled -data  form. 
Note  that  the  sampling  rate  is  doubled  at  the  output 


The  sampled  data  representation  of  the  conventional  LPC  output  is  expressed  in  the  form 

X(z)~  £  x(nT)z~"  (21) 

it— » 

where  x(nT )  is  the  nth  sampled  value  of  the  synthesized  speech,  Tis  the  sampling  time  interval  (125 
fis  for  the  conventional  narrowband  LPC  using  an  8  kHz  sampling  rate),  and  z~l  is  the  sifting  operation 
by  one  sampling  time  interval.  The  spectrum  of  the  sampled  signal  is  the  sum  of  the  signal  spectrum 
from  1  to  1/2  T  Hz  and  the  shifted  complementary  spectra  at  every  multiple  of  1/7"  Hz. 
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The  sampled  data  representation  of  the  output  of  the  spectral  folding  process  may  be  expressed  by 


Within  the  expanded  passband  of  0  to  \/T  (i.e.,  0  to  8  kHz),  the  signal  spectrum  from  0  to  1/2 T  (0  to 
4  kHz)  is  contributed  by  the  first  term  in  Eq.  (23),  and  the  folded  spectrum  from  1/2  T  to  l/7’(4  to  8 
kHz)  is  contributed  by  the  second  term. 

Low-Pass  Filtering 

This  may  be  accomplished  by  a  single  analog  filter  at  the  output  of  the  digital-to-analog  (D-A) 
converter,  or  by  a  combination  of  a  digital  filter  prior  to  the  D-A  converter  and  a  less  stringently 
designed  analog  filter  at  the  output  of  the  D-A  converter.  We  used  the  first  approach  in  a  real-time 
implementation  by  using  the  existing  LPC  processor,  and  the  second  approach  in  a  nonreal-time  simu¬ 
lation.  The  filter  characteristics  are  not  too  critical,  but  we  recommend  that  the  attenuations  at  4  and  8 
kHz  should  be  about  3  and  60  dB,  respectively. 

Spectrographic  Analyses  of  Narrowband  LPC  Ontpnt 

This  output  bandwidth  expansion  process  has  been  incorporated  in  our  conventional  narrowband 
LPC  operating  in  real  time.  We  used  an  analog  filter  to  suppress  the  frequency  contents  above  6  kHz, 
since  the  computational  time  available  from  the  processor  was  insufficient  for  digital  filtering.  Figure  18 
shows  spectrographic  analyses  of  a  female  voice  before  and  after  LPC  analysis  and  synthesis.  As  noted, 
the  fricative  sounds  at  the  end  of  THOSE  and  the  beginning  of  CHILDREN  are  reproduced  beyond  the 
4  kHz  passband  of  the  LPC.  The  resulting  sound  quality  is  noticeably  closer  to  that  of  the  unprocessed 
speech.  Even  the  small  burst  waveform  at  the  onset  of  DIRTY  has  been  reproduced  with  an  expanded 
bandwidth  as  in  the  unprocessed  speech.  Figure  19  presents  similar  spectrographic  analyses  of  a  male 
voice. 

Figure  20  is  the  narrowband  spectrographic  analysis  of  the  same  female  voice  shown  in  Fig.  18. 
Note  that  the  pitch  harmonics  are  evenly  spaced  in  the  frequency  range  above  4  kHz,  indicating  that  no 
audible  distortions  are  created  by  noninteger  pitch  harmonics. 

Test  and  Evaluation 

The  use  of  the  extended  output  bandwidth  makes  the  synthesized  LPC  speech  noticeably  brighter, 
less  muffled,  and  more  pleasant  to  listen  to.  To  evaluate  this  improvement  quantitatively,  we  turned 
once  again  to  the  DAM  test.  The  results  show  a  2.5-point  increase  in  the  overall  quality  score  from 
46.7  for  the  conventional  LPC  to  49.2  with  the  extended  output  bandwidth  (3  male  and  3  female  speak¬ 
ers). 

A  DRT  was  also  run  to  evaluate  the  effect  of  the  extended  output  bandwidth  on  the  intelligibility 
of  the  LPC  speech.  As  we  expected,  the  DRT  score  for  our  modified  LPC  was  virtually  identical  to 
that  of  the  conventional  LPC,  indicating  that  this  modification  does  improve  the  speech  quality  with  no 
adverse  effect  on  the  speech  intelligibility. 


(b)  Our  narrowband  LPC  output 


Pig.  II  —  Spectrographic  anaiyaia  of  original  female  apeech  and  the  output  of  the  narrowband  LPC  with  our 
bandwidth  expanakm.  Note  that  the  fricative  apectra  are  agreed  beyond  the  pan  bend  of  the  conventional 
narrowband  LPC.  Since  their  apectral  dbrtributiona  are  more  aimilar  to  tboae  of  the  original  apeech  they 
aound  more  naturaL 


FREQUENCY  (kHz) 


KANO  AND  EVERETT 


LETS  TALK 


» ;  ■  4 

r  k 


(a)  Original  speech 


(b)  Our  narrowband  LPC  output 

Fig.  19  —  Spectrographic  analysis  of  originial  male  speech  and  output  of  the 
narrowband  LPC  with  our  bandwidth  expansion 


Fig.  20  —  Narrowband  spectrographic  analysis  of  the  LPC  output  with  our  bandwidth  expan¬ 
sion  (female  speech).  Evenly  spaced  pitch  harmonics  indicate  that  the  output  bandwidth  ex¬ 
pansion  process  does  not  introduce  pitch  deformations.  The  wideband  spectrographic  analysis 
of  the  same  speech  is  shown  in  Fig.  18(b). 


38 


NRL  REPORT  1799 


CONCLUSIONS 

The  objective  of  this  effort  was  to  improve  the  narrowband  LPC  speech  without  compromising  the 
existing  DoD  interoperability  requirements  on  the  speech  sampling  rate,  the  frame  rate,  and  the  param¬ 
eter  coding  formats.  These  requirements  are  expected  to  remain  unaltered  for  many  years.  Thus,  it  is 
essential  to  work  within  these  constraints  so  that  any  useful  results  from  these  research  efforts  will 
benefit  the  Navy  and  DoD  in  general. 

Since  the  narrowband  LPC  transmits  the  speech  at  a  low  bit  rate  (less  than  5%  of  the  original 
speech  transmission  rate),  some  of  the  speech  parameters— particularly  those  of  the  excitation 
source— cannot  be  transmitted  and  are  introduced  at  the  receiver  as  fixed  parameters.  The  major  weak¬ 
ness  of  the  narrowband  LPC  synthesizer  lies  in  the  use  of  fixed  parameters  which  do  not  reflect  the 
changing  nature  of  human  speech.  We  have  modified  the  amplitude  and  phase  spectra  of  the  voiced 
excitation  signal,  as  well  as  the  temporal  characteristics  of  the  unvoiced  excitation,  to  simulate  some  of 
these  natural  irregularities. 

Though  these  modifications  can  be  implemented  independently,  the  greatest  benefit  is  obtained 
when  these  synthesis  improvements  are  combined  with  the  analysis  improvements  we  presented  in  an 
earlier  report  The  speech  quality  and  intelligibility  of  the  resulting  narrowband  LPC  will  be  nearly 
comparable  to  that  of  a  voice  processor  operating  at  four  times  the  data  rate  of  the  LPC. 

After  nearly  a  decade  of  research  and  development,  the  narrowband  LPC  is  now  a  practical  means 
for  digitizing  speech  at  low  bit  rates  and  is  becoming  widely  deployed  in  military  platforms  and  com¬ 
munication  centers.  Our  efforts,  and  similar  efforts  by  other  investigators,  win  help  make  the  nar¬ 
rowband  LPC  more  acceptable  to  general  users. 
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